Closed Bug 708614 Opened 13 years ago Closed 13 years ago

RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux on core-2-duo

Categories

(Tamarin Graveyard :: Virtual Machine, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED
Q2 12 - Cyril

People

(Reporter: lhansen, Assigned: lhansen)

References

Details

Attachments

(6 files, 2 obsolete files)

Identical bytecode compiled with new ASC without -abcfuture and the -swf switch without version=16.  Results units are frames per second, higher=better:

                  FRR       FR-FLOAT

RayTrace-Num      24          19
RayTrace-Num-V    36          25

The theory is that this kind of regression should not happen, indeed, since FR-FLOAT has vector optimizations it should be faster than FRR.

Another curious thing is that in FR-FLOAT, the float code corresponding to RayTrace-Num-V, RayTrace-float-V, runs at 36 FPS easily.
Rebuilding the tests with an older ASC against older playerglobal/builtin files does not help.  Definitely looks like a VM problem, not a compiler problem (but I need to look at the dumps to be sure).
Machine is 2008 MacBook Pro / Core-2-Duo / 2.6GHz / 4GB.
Assignee: nobody → lhansen
(Very preliminary.)

Could look like there's a bug with relational operators, in this code:

  dist = closestApproach - Math.sqrt(halfCord2);
  if( dist < closestIntersectionDist )

we're looking at this kind of output:

    subd4 = subd addd15, Math_sqrt3 (in reg? -1)
    ...
    intToAtom1 = calli.none #intToAtom ( immi1/*0x1019030 core*/ ldi77 )
    compare1 = calli.all #compare ( intToAtom1 ldi79 ) (in reg? -1)

but dist and closestApproach and closestIntersectionDist and the output of Math.sqrt are all Number.

More investigation needed, just posting this in case it triggers memories (I could be looking at the wrong code).
Attached file Test case, not Player dependent (obsolete) —
The issue in comment #3 turned out to be a bug in the test program, there is an untyped variable that is used as a loop limit (in all the programs).  Fixing that, we see that there's a speedup across the board, but the speedup is greater on FRR (36->47 FPS for Num-V) than on FR-FLOAT (25->29 FPS).  So there's clearly a performance drain here somewhere.

(The float cases speed up quite a bit too and for the float4 program we're up to 57 fps, from below 50 before.)

The untyped variable was there in the original program.  This illustrates that the compiler MUST have some sort of "-fast" mode where it warns about untyped variables, Array, and so on.
Attachment #580050 - Attachment is obsolete: true
Not really all that relevant, but worth keeping in mind:

The performance difference is not observable on my Mac Pro / 2 x 2.93GHz Quad-Core Xeon / 16 GB.

On this machine, RayTrace-Num reaches 31 fps and RayTrace-Num-V reaches 58fps, on both FRR and FR-FLOAT.  The Num-V results match the float-V results on FR-FLOAT.

Also on this machine the float4-V code is *much* slower than the float-V code (43 fps vs 58 fps), but that will be the subject of another bug.
In another experiment, the iterated Num-V kernel (the attached test case) has the same performance in tamarin-redux and tr-float, on both machines.

As FR-FLOAT is derived from FRR there should be limited room for differences in the two, apart from the VM and the build settings.
(In reply to Lars T Hansen from comment #7)

> The performance difference is not observable on my Mac Pro / 2 x 2.93GHz
> Quad-Core Xeon / 16 GB.
> 
> On this machine, RayTrace-Num reaches 31 fps and RayTrace-Num-V reaches
> 58fps, on both FRR and FR-FLOAT.  The Num-V results match the float-V
> results on FR-FLOAT.
> 
> Also on this machine the float4-V code is *much* slower than the float-V
> code (43 fps vs 58 fps), but that will be the subject of another bug.

NOTE: Those were 64-bit runs (because the "Open in 32-bit" setting does not follow the app when it's moved but is stored in the Finder somehow).

With 32-bit, the float4-V code is on par with the float-V code on Xeon (whew).

The RayTracer-Num and RayTracer-Num-V results are the same for the two players on this platform also for 32-bit runs, but the RayTracer-Num results are better (41 fps).  Of course it remains a mystery why it would be 10fps slower on a 64-bit build...

The others all peak at 58fps.  I don't know what the max frame rate that the player supports is, I compiled the SWF for 100fps.
Summary: RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux → RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux on core-2-duo
Apart from constants and addresses the machine code generated for the kernel program is identical on the MacBook Pro and the Mac Pro (Debug-Debugger build with -Dnodebugger -Dverbose=jit, textual comparison of the output after replacing all hex constants).
There are quite a few differences in the code generated for the kernel by tamarin-redux and tr-float.  The tr code consistently has more restores to merge edges before the calls to the vector getters and also has all the finddefs intact, of course.  There are many other small differences, eg in the registers used.  But I sense I won't find the answer this way.
The things I know it *not* to be:

- computing and setting the RGB values
- setPixel
- frame rate interaction (tried compiling with various frame rates)
- square root (replaced with AS3 version, they still differ)

Optimization settings in projects look to be the same, ditto the compilers are the same - GCC 4.2.

Once we've removed the square root calls from the program there isn't much else going on that isn't the JIT's doing (one finddef call for Number.POSITIVE_INFINITY which ASC does not constant-fold because I'm compiling without -abcfuture), and as previously observed the JIT's code looks pretty much the same for the two runtimes.
(Of course that instruction dump is from the shell, where no difference is observed, so take it for what it's worth.)
Attached file Release config diff
Release configs are not the same.
Changing the release config appears to remove the performance regression.  This is weird, but there you have it.  Will investigate a little further but I consider this issue likely closed.
With that fix in place:

2008 MacBook Pro / Core-2-Duo / 2.6GHz / 4GB, fps (higher is better:

             FR-FLOAT      FRR

Num            30           30
Num-V          46           45
float-V        52
float4-V       57

Methodology: start up, run 20 seconds, record frame rate.  Verified that all were set to "Open as 32 bit".
I should mention: all compiled by the new ASC, but Num and Num-V were compiled without -abcfuture and without version=16.

Here are the numbers when those programs are compiled with those switches (the resulting code only runs on FR-FLOAT):


             FR-FLOAT

Num             2 [sic]
Num-V          46


The effectively untyped nature of Num (through its use of Array) and the poor code generation for untyped numeric code in FR-FLOAT combine to make the performance very bad.

Notably here, "old content" (compiled without -abcfuture) does not suffer a performance regression, as per our design.
Attached file RayTracer-Num.as
Attached file RayTracer-Num-V.as
Attached file RayTracer-float-V.as
Attached file RayTracer-float4-V.as (obsolete) —
BTW, there are some "ASC bug" annotations in RayTracer-float4-V.as that can probably be removed now.
Updated: Removed the now-redundant float4() casts, retested.  Performance is still fine.
Attachment #580377 - Attachment is obsolete: true
2009 Mac Pro / 2xQuad-core Xeon / 2.93GHz / 16GB, fps (higher is better:

             FR-FLOAT      FRR

Num            42           42
Num-V          58           58
float-V        58
float4-V       58

Methodology: start up, run 20 seconds, record frame rate.  Verified that all were set to "Open as 32 bit".  All compiled by the new ASC, but Num and Num-V were compiled without -abcfuture and without version=16.

It's remarkable that they're all approaching 60fps, slowly slowly.  I disabled framerate limiting but I don't know for a fact that that works properly in standalone builds.  It may be that the thing to do here is to run with larger canvas sizes or many more objects in the scene, on such a fast machine (or indeed on the slower machine too).
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Another methodology might be to change the test to calibrate its own workload to acheive 30fps, with the frame rate still set to 60 or 100fps.  As for screen size vs # of objects, I think # of objects is a more interesting metric.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: