Closed Bug 708614 Opened 13 years ago Closed 13 years ago

RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux on core-2-duo

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

Q2 12 - Cyril

People

(Reporter: lhansen, Assigned: lhansen)

References

Details

Attachments

(6 files, 2 obsolete files)

Test case, not Player dependent 13 years ago Lars T Hansen 13.00 KB, text/plain		Details
Test case, not Player dependent, v2 13 years ago Lars T Hansen 13.00 KB, text/plain		Details
Release config diff 13 years ago Lars T Hansen 1014 bytes, text/plain		Details
RayTracer-Num.as 13 years ago Lars T Hansen 19.37 KB, text/plain		Details
RayTracer-Num-V.as 13 years ago Lars T Hansen 19.67 KB, text/plain		Details
RayTracer-float-V.as 13 years ago Lars T Hansen 19.51 KB, text/plain		Details
RayTracer-float4-V.as 13 years ago Lars T Hansen 16.34 KB, text/plain		Details
RayTracer-float4-V.as, updated 13 years ago Lars T Hansen 15.98 KB, text/plain		Details

Lars T Hansen

Assignee

Description

•

13 years ago

Identical bytecode compiled with new ASC without -abcfuture and the -swf switch without version=16.  Results units are frames per second, higher=better:

                  FRR       FR-FLOAT

RayTrace-Num      24          19
RayTrace-Num-V    36          25

The theory is that this kind of regression should not happen, indeed, since FR-FLOAT has vector optimizations it should be faster than FRR.

Another curious thing is that in FR-FLOAT, the float code corresponding to RayTrace-Num-V, RayTrace-float-V, runs at 36 FPS easily.

Lars T Hansen

Assignee

Comment 1

•

13 years ago

Rebuilding the tests with an older ASC against older playerglobal/builtin files does not help.  Definitely looks like a VM problem, not a compiler problem (but I need to look at the dumps to be sure).

Lars T Hansen

Assignee

Comment 2

•

13 years ago

Machine is 2008 MacBook Pro / Core-2-Duo / 2.6GHz / 4GB.

Lars T Hansen

Assignee

Updated

•

13 years ago

Assignee: nobody → lhansen

Lars T Hansen

Assignee

Comment 3

•

13 years ago

(Very preliminary.)

Could look like there's a bug with relational operators, in this code:

  dist = closestApproach - Math.sqrt(halfCord2);
  if( dist < closestIntersectionDist )

we're looking at this kind of output:

    subd4 = subd addd15, Math_sqrt3 (in reg? -1)
    ...
    intToAtom1 = calli.none #intToAtom ( immi1/*0x1019030 core*/ ldi77 )
    compare1 = calli.all #compare ( intToAtom1 ldi79 ) (in reg? -1)

but dist and closestApproach and closestIntersectionDist and the output of Math.sqrt are all Number.

More investigation needed, just posting this in case it triggers memories (I could be looking at the wrong code).

Lars T Hansen

Assignee

Comment 4

•

13 years ago

Attached file Test case, not Player dependent (obsolete) — Details

Lars T Hansen

Assignee

Comment 5

•

13 years ago

The issue in comment #3 turned out to be a bug in the test program, there is an untyped variable that is used as a loop limit (in all the programs).  Fixing that, we see that there's a speedup across the board, but the speedup is greater on FRR (36->47 FPS for Num-V) than on FR-FLOAT (25->29 FPS).  So there's clearly a performance drain here somewhere.

(The float cases speed up quite a bit too and for the float4 program we're up to 57 fps, from below 50 before.)

The untyped variable was there in the original program.  This illustrates that the compiler MUST have some sort of "-fast" mode where it warns about untyped variables, Array, and so on.

Lars T Hansen

Assignee

Comment 6

•

13 years ago

Attached file Test case, not Player dependent, v2 — Details

Attachment #580050 - Attachment is obsolete: true

Lars T Hansen

Assignee

Comment 7

•

13 years ago

Not really all that relevant, but worth keeping in mind:

The performance difference is not observable on my Mac Pro / 2 x 2.93GHz Quad-Core Xeon / 16 GB.

On this machine, RayTrace-Num reaches 31 fps and RayTrace-Num-V reaches 58fps, on both FRR and FR-FLOAT.  The Num-V results match the float-V results on FR-FLOAT.

Also on this machine the float4-V code is *much* slower than the float-V code (43 fps vs 58 fps), but that will be the subject of another bug.

Lars T Hansen

Assignee

Comment 8

•

13 years ago

In another experiment, the iterated Num-V kernel (the attached test case) has the same performance in tamarin-redux and tr-float, on both machines.

As FR-FLOAT is derived from FRR there should be limited room for differences in the two, apart from the VM and the build settings.

Lars T Hansen

Assignee

Comment 9

•

13 years ago

(In reply to Lars T Hansen from comment #7)

> The performance difference is not observable on my Mac Pro / 2 x 2.93GHz
> Quad-Core Xeon / 16 GB.
> 
> On this machine, RayTrace-Num reaches 31 fps and RayTrace-Num-V reaches
> 58fps, on both FRR and FR-FLOAT.  The Num-V results match the float-V
> results on FR-FLOAT.
> 
> Also on this machine the float4-V code is *much* slower than the float-V
> code (43 fps vs 58 fps), but that will be the subject of another bug.

NOTE: Those were 64-bit runs (because the "Open in 32-bit" setting does not follow the app when it's moved but is stored in the Finder somehow).

With 32-bit, the float4-V code is on par with the float-V code on Xeon (whew).

The RayTracer-Num and RayTracer-Num-V results are the same for the two players on this platform also for 32-bit runs, but the RayTracer-Num results are better (41 fps).  Of course it remains a mystery why it would be 10fps slower on a 64-bit build...

The others all peak at 58fps.  I don't know what the max frame rate that the player supports is, I compiled the SWF for 100fps.

Lars T Hansen

Assignee

Updated

•

13 years ago

Summary: RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux → RayTracer-Num and RayTracer-Num-V significantly slower in fr-float than in flashruntime-redux on core-2-duo

Lars T Hansen

Assignee

Comment 10

•

13 years ago

Apart from constants and addresses the machine code generated for the kernel program is identical on the MacBook Pro and the Mac Pro (Debug-Debugger build with -Dnodebugger -Dverbose=jit, textual comparison of the output after replacing all hex constants).

Lars T Hansen

Assignee

Comment 11

•

13 years ago

There are quite a few differences in the code generated for the kernel by tamarin-redux and tr-float.  The tr code consistently has more restores to merge edges before the calls to the vector getters and also has all the finddefs intact, of course.  There are many other small differences, eg in the registers used.  But I sense I won't find the answer this way.

Lars T Hansen

Assignee

Comment 12

•

13 years ago

The things I know it *not* to be:

- computing and setting the RGB values
- setPixel
- frame rate interaction (tried compiling with various frame rates)
- square root (replaced with AS3 version, they still differ)

Optimization settings in projects look to be the same, ditto the compilers are the same - GCC 4.2.

Once we've removed the square root calls from the program there isn't much else going on that isn't the JIT's doing (one finddef call for Number.POSITIVE_INFINITY which ASC does not constant-fold because I'm compiling without -abcfuture), and as previously observed the JIT's code looks pretty much the same for the two runtimes.

Lars T Hansen

Assignee

Comment 13

•

13 years ago

(Of course that instruction dump is from the shell, where no difference is observed, so take it for what it's worth.)

Lars T Hansen

Assignee

Comment 14

•

13 years ago

Attached file Release config diff — Details

Release configs are not the same.

Lars T Hansen

Assignee

Comment 15

•

13 years ago

Changing the release config appears to remove the performance regression.  This is weird, but there you have it.  Will investigate a little further but I consider this issue likely closed.

Lars T Hansen

Assignee

Comment 16

•

13 years ago

With that fix in place:

2008 MacBook Pro / Core-2-Duo / 2.6GHz / 4GB, fps (higher is better:

             FR-FLOAT      FRR

Num            30           30
Num-V          46           45
float-V        52
float4-V       57

Methodology: start up, run 20 seconds, record frame rate.  Verified that all were set to "Open as 32 bit".

Lars T Hansen

Assignee

Comment 17

•

13 years ago

I should mention: all compiled by the new ASC, but Num and Num-V were compiled without -abcfuture and without version=16.

Here are the numbers when those programs are compiled with those switches (the resulting code only runs on FR-FLOAT):


             FR-FLOAT

Num             2 [sic]
Num-V          46


The effectively untyped nature of Num (through its use of Array) and the poor code generation for untyped numeric code in FR-FLOAT combine to make the performance very bad.

Notably here, "old content" (compiled without -abcfuture) does not suffer a performance regression, as per our design.

Lars T Hansen

Assignee

Comment 18

•

13 years ago

Attached file RayTracer-Num.as — Details

Lars T Hansen

Assignee

Comment 19

•

13 years ago

Attached file RayTracer-Num-V.as — Details

Lars T Hansen

Assignee

Comment 20

•

13 years ago

Attached file RayTracer-float-V.as — Details

Lars T Hansen

Assignee

Comment 21

•

13 years ago

Attached file RayTracer-float4-V.as (obsolete) — Details

Lars T Hansen

Assignee

Comment 22

•

13 years ago

BTW, there are some "ASC bug" annotations in RayTracer-float4-V.as that can probably be removed now.

Lars T Hansen

Assignee

Comment 23

•

13 years ago

Attached file RayTracer-float4-V.as, updated — Details

Updated: Removed the now-redundant float4() casts, retested.  Performance is still fine.

Attachment #580377 - Attachment is obsolete: true

Lars T Hansen

Assignee

Comment 24

•

13 years ago

2009 Mac Pro / 2xQuad-core Xeon / 2.93GHz / 16GB, fps (higher is better:

             FR-FLOAT      FRR

Num            42           42
Num-V          58           58
float-V        58
float4-V       58

Methodology: start up, run 20 seconds, record frame rate.  Verified that all were set to "Open as 32 bit".  All compiled by the new ASC, but Num and Num-V were compiled without -abcfuture and without version=16.

It's remarkable that they're all approaching 60fps, slowly slowly.  I disabled framerate limiting but I don't know for a fact that that works properly in standalone builds.  It may be that the thing to do here is to run with larger canvas sizes or many more objects in the scene, on such a fast machine (or indeed on the slower machine too).

Lars T Hansen

Assignee

Updated

•

13 years ago

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Edwin Smith

Comment 25

•

13 years ago

Another methodology might be to change the test to calibrate its own workload to acheive 30fps, with the frame rate still set to 60 or 100fps.  As for screen size vs # of objects, I think # of objects is a more interesting metric.

You need to log in before you can comment on or make changes to this bug.