RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux on core-2-duo

RESOLVED FIXED in Q2 12 - Cyril

Status

Tamarin
Virtual Machine
P2
normal
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: Lars T Hansen, Assigned: Lars T Hansen)

Tracking

(Blocks: 1 bug)

unspecified
Q2 12 - Cyril
x86
Mac OS X

Details

Attachments

(6 attachments, 2 obsolete attachments)

(Assignee)

Description

6 years ago
Identical bytecode compiled with new ASC without -abcfuture and the -swf switch without version=16.  Results units are frames per second, higher=better:

                  FRR       FR-FLOAT

RayTrace-Num      24          19
RayTrace-Num-V    36          25

The theory is that this kind of regression should not happen, indeed, since FR-FLOAT has vector optimizations it should be faster than FRR.

Another curious thing is that in FR-FLOAT, the float code corresponding to RayTrace-Num-V, RayTrace-float-V, runs at 36 FPS easily.
(Assignee)

Comment 1

6 years ago
Rebuilding the tests with an older ASC against older playerglobal/builtin files does not help.  Definitely looks like a VM problem, not a compiler problem (but I need to look at the dumps to be sure).
(Assignee)

Comment 2

6 years ago
Machine is 2008 MacBook Pro / Core-2-Duo / 2.6GHz / 4GB.
(Assignee)

Updated

6 years ago
Assignee: nobody → lhansen
(Assignee)

Comment 3

6 years ago
(Very preliminary.)

Could look like there's a bug with relational operators, in this code:

  dist = closestApproach - Math.sqrt(halfCord2);
  if( dist < closestIntersectionDist )

we're looking at this kind of output:

    subd4 = subd addd15, Math_sqrt3 (in reg? -1)
    ...
    intToAtom1 = calli.none #intToAtom ( immi1/*0x1019030 core*/ ldi77 )
    compare1 = calli.all #compare ( intToAtom1 ldi79 ) (in reg? -1)

but dist and closestApproach and closestIntersectionDist and the output of Math.sqrt are all Number.

More investigation needed, just posting this in case it triggers memories (I could be looking at the wrong code).
(Assignee)

Comment 4

6 years ago
Created attachment 580050 [details]
Test case, not Player dependent
(Assignee)

Comment 5

6 years ago
The issue in comment #3 turned out to be a bug in the test program, there is an untyped variable that is used as a loop limit (in all the programs).  Fixing that, we see that there's a speedup across the board, but the speedup is greater on FRR (36->47 FPS for Num-V) than on FR-FLOAT (25->29 FPS).  So there's clearly a performance drain here somewhere.

(The float cases speed up quite a bit too and for the float4 program we're up to 57 fps, from below 50 before.)

The untyped variable was there in the original program.  This illustrates that the compiler MUST have some sort of "-fast" mode where it warns about untyped variables, Array, and so on.
(Assignee)

Comment 6

6 years ago
Created attachment 580340 [details]
Test case, not Player dependent, v2
Attachment #580050 - Attachment is obsolete: true
(Assignee)

Comment 7

6 years ago
Not really all that relevant, but worth keeping in mind:

The performance difference is not observable on my Mac Pro / 2 x 2.93GHz Quad-Core Xeon / 16 GB.

On this machine, RayTrace-Num reaches 31 fps and RayTrace-Num-V reaches 58fps, on both FRR and FR-FLOAT.  The Num-V results match the float-V results on FR-FLOAT.

Also on this machine the float4-V code is *much* slower than the float-V code (43 fps vs 58 fps), but that will be the subject of another bug.
(Assignee)

Comment 8

6 years ago
In another experiment, the iterated Num-V kernel (the attached test case) has the same performance in tamarin-redux and tr-float, on both machines.

As FR-FLOAT is derived from FRR there should be limited room for differences in the two, apart from the VM and the build settings.
(Assignee)

Comment 9

6 years ago
(In reply to Lars T Hansen from comment #7)

> The performance difference is not observable on my Mac Pro / 2 x 2.93GHz
> Quad-Core Xeon / 16 GB.
> 
> On this machine, RayTrace-Num reaches 31 fps and RayTrace-Num-V reaches
> 58fps, on both FRR and FR-FLOAT.  The Num-V results match the float-V
> results on FR-FLOAT.
> 
> Also on this machine the float4-V code is *much* slower than the float-V
> code (43 fps vs 58 fps), but that will be the subject of another bug.

NOTE: Those were 64-bit runs (because the "Open in 32-bit" setting does not follow the app when it's moved but is stored in the Finder somehow).

With 32-bit, the float4-V code is on par with the float-V code on Xeon (whew).

The RayTracer-Num and RayTracer-Num-V results are the same for the two players on this platform also for 32-bit runs, but the RayTracer-Num results are better (41 fps).  Of course it remains a mystery why it would be 10fps slower on a 64-bit build...

The others all peak at 58fps.  I don't know what the max frame rate that the player supports is, I compiled the SWF for 100fps.
(Assignee)

Updated

6 years ago
Summary: RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux → RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux on core-2-duo
(Assignee)

Comment 10

6 years ago
Apart from constants and addresses the machine code generated for the kernel program is identical on the MacBook Pro and the Mac Pro (Debug-Debugger build with -Dnodebugger -Dverbose=jit, textual comparison of the output after replacing all hex constants).
(Assignee)

Comment 11

6 years ago
There are quite a few differences in the code generated for the kernel by tamarin-redux and tr-float.  The tr code consistently has more restores to merge edges before the calls to the vector getters and also has all the finddefs intact, of course.  There are many other small differences, eg in the registers used.  But I sense I won't find the answer this way.
(Assignee)

Comment 12

6 years ago
The things I know it *not* to be:

- computing and setting the RGB values
- setPixel
- frame rate interaction (tried compiling with various frame rates)
- square root (replaced with AS3 version, they still differ)

Optimization settings in projects look to be the same, ditto the compilers are the same - GCC 4.2.

Once we've removed the square root calls from the program there isn't much else going on that isn't the JIT's doing (one finddef call for Number.POSITIVE_INFINITY which ASC does not constant-fold because I'm compiling without -abcfuture), and as previously observed the JIT's code looks pretty much the same for the two runtimes.
(Assignee)

Comment 13

6 years ago
(Of course that instruction dump is from the shell, where no difference is observed, so take it for what it's worth.)
(Assignee)

Comment 14

6 years ago
Created attachment 580356 [details]
Release config diff

Release configs are not the same.
(Assignee)

Comment 15

6 years ago
Changing the release config appears to remove the performance regression.  This is weird, but there you have it.  Will investigate a little further but I consider this issue likely closed.
(Assignee)

Comment 16

6 years ago
With that fix in place:

2008 MacBook Pro / Core-2-Duo / 2.6GHz / 4GB, fps (higher is better:

             FR-FLOAT      FRR

Num            30           30
Num-V          46           45
float-V        52
float4-V       57

Methodology: start up, run 20 seconds, record frame rate.  Verified that all were set to "Open as 32 bit".
(Assignee)

Comment 17

6 years ago
I should mention: all compiled by the new ASC, but Num and Num-V were compiled without -abcfuture and without version=16.

Here are the numbers when those programs are compiled with those switches (the resulting code only runs on FR-FLOAT):


             FR-FLOAT

Num             2 [sic]
Num-V          46


The effectively untyped nature of Num (through its use of Array) and the poor code generation for untyped numeric code in FR-FLOAT combine to make the performance very bad.

Notably here, "old content" (compiled without -abcfuture) does not suffer a performance regression, as per our design.
(Assignee)

Comment 18

6 years ago
Created attachment 580374 [details]
RayTracer-Num.as
(Assignee)

Comment 19

6 years ago
Created attachment 580375 [details]
RayTracer-Num-V.as
(Assignee)

Comment 20

6 years ago
Created attachment 580376 [details]
RayTracer-float-V.as
(Assignee)

Comment 21

6 years ago
Created attachment 580377 [details]
RayTracer-float4-V.as
(Assignee)

Comment 22

6 years ago
BTW, there are some "ASC bug" annotations in RayTracer-float4-V.as that can probably be removed now.
(Assignee)

Comment 23

6 years ago
Created attachment 580389 [details]
RayTracer-float4-V.as, updated

Updated: Removed the now-redundant float4() casts, retested.  Performance is still fine.
Attachment #580377 - Attachment is obsolete: true
(Assignee)

Comment 24

6 years ago
2009 Mac Pro / 2xQuad-core Xeon / 2.93GHz / 16GB, fps (higher is better:

             FR-FLOAT      FRR

Num            42           42
Num-V          58           58
float-V        58
float4-V       58

Methodology: start up, run 20 seconds, record frame rate.  Verified that all were set to "Open as 32 bit".  All compiled by the new ASC, but Num and Num-V were compiled without -abcfuture and without version=16.

It's remarkable that they're all approaching 60fps, slowly slowly.  I disabled framerate limiting but I don't know for a fact that that works properly in standalone builds.  It may be that the thing to do here is to run with larger canvas sizes or many more objects in the scene, on such a fast machine (or indeed on the slower machine too).
(Assignee)

Updated

6 years ago
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED

Comment 25

6 years ago
Another methodology might be to change the test to calibrate its own workload to acheive 30fps, with the frame rate still set to 60 or 100fps.  As for screen size vs # of objects, I think # of objects is a more interesting metric.
You need to log in before you can comment on or make changes to this bug.