Open Bug 671957 Opened 13 years ago Updated 2 years ago

Compare simple C kernels on ARM and i7

Categories

(Core :: JavaScript Engine, defect)

defect

Tracking

()

People

(Reporter: dmandelin, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Stuart told me it would be awesome if websites could work on mobile just as well as on desktop. For JS, this basically means performance. I'm not sure yet how to think about time and space tradeoffs, but for now maybe we can think of it as how much speed we can get with an acceptable space cost.

There are 3 main questions here:

1. What is the current state of JS perf on mobile ARM devices compared to desktop?
2. How close can we get JS perf on ARM to desktop?
3. What do we have to do to get that close?
So I decided to see how various aspects of the hardware compare between x86 and arm.  With a rather simplistic test, I got these numbers (in seconds for the run)
800 mhz arm:
double:              73.965850
single:              64.780529
integer:             20.124182
random memory access: 5.966641
2300 mhz core i7: (about 4 times the clock speed):

double:               5.851761
single:               5.861845
int:                  4.758156
random memory access: 1.510734

these tests were written in C, compiled with -O3, and turning on vfp on arm.  It seems like the arm floating point code is running at less than a tenth of the x86 floating point code.  Arm's integer and the memory accesses both seem to be keeping up with x86.  Unfortunately, this seems to indicate that code that is floating point heavy will have lots of trouble keeping up.  When I get some newer hardware, I'll test with that.
(In reply to comment #1)

I don't know what the C is, exactly, but I would guess that this test implicitly assumes that the compilation from x86 GCC (presumably with SSE enabled? or is it just x87?) and ARM-with-VFP GCC is at parity, which I seriously doubt.

In order to get a solid comparison here I think comparing some FPU kernels that can't be easily reordered (since all the ARM uprocs we'll have are in-order) is a better test. Especially because in the JIT we hand craft the code! :-)
(In reply to comment #1)
> It seems like the arm floating point code is running at less than a tenth of
> the x86 floating point code.  Arm's integer and the memory accesses both
> seem to be keeping up with x86.  Unfortunately, this seems to indicate that
> code that is floating point heavy will have lots of trouble keeping up. 
> When I get some newer hardware, I'll test with that.

Nice! This is exactly the kind of basic info we need.

I think the i7 is only 3x the clock speed in this case, right? For int and memory, you measured a 4x difference. It seems like that could be due to things like 'width' (is that the right term? I mean things like fetch bandwidth and number of functional units), reordering, and/or memory bus speed.

- I'd be curious about a variant on Chris's idea: make a little integer kernel in C, test that, and then see if hand-hacking the ARM assembly can improve it. Maybe get help from Jacob on that.

- Another key question is: Are the benchmark scores in line with the results above? I.e., do we tend to run 10x slower on the fp ones and 4x slower on the others? That would be great to know.

(Note: I intended for this to be a meta bug, and each experiment on this general topic would be a separate bug linking to it. See bug 642003 for a prior example of this bug organization. Don't worry about it here, I'll make a new meta bug and morph this one.)
Assignee: dmandelin → mrosenberg
Summary: [meta] Understand JS performance on ARM → Compare simple C kernels on ARM and i7
Blocks: 673000
(In reply to comment #3)
> - I'd be curious about a variant on Chris's idea: make a little integer
> kernel in C, test that, and then see if hand-hacking the ARM assembly can
> improve it. Maybe get help from Jacob on that.

Marty and I also chatted about this a little bit IRL -- if you make a micro-kernel with a single long (loop carried) dependency chain the issue width of your hardware won't matter and your reordering window won't help you at all. You can scale the number of simultaneous dependency chains up to see where the issue width chokes on the "less super" scalar pipelines.
(In reply to comment #3)
> - I'd be curious about a variant on Chris's idea: make a little integer
> kernel in C, test that, and then see if hand-hacking the ARM assembly can
> improve it. Maybe get help from Jacob on that.

GCC does a pretty good job on common cases, i.e. if you aren't doing
anything crazy and on the boundaries of normal C. Still, it's always
worth having a look to check.

Consider A8 the lowest common denominator in terms of ARM floating-point
performance. A8's VFP is not pipelined. From its TRM:

"The VFP coprocessor is a nonpipelined floating-point execution engine
that can execute any VFPv3 data-processing instruction. Each instruction
runs to completion before the next instruction can issue, and there is
no forwarding of VFP results to other instructions."

You can find this at infocenter.arm.com ("Cortex-A series processors" ->
"Cortex-A8" -> "r3p2" -> "Instruction Cycle Timing" -> "VFP instructions").

A9's VFP is entirely different, and it's very fast. I suspect that the
likes of Qualcomm's Snapdragon (in several HTC phones) also have fast
VFP, though I've not measured them.

Also, bear in mind that even on FP-heavy JavaScript, the overall
percentage of VFP instructions involved is tiny compared to the loads
and other overheads that we have. VFP performance still has an impact,
but not as much as it would for C code.
(In reply to comment #5)
> (In reply to comment #3)
> > - I'd be curious about a variant on Chris's idea: make a little integer
> > kernel in C, test that, and then see if hand-hacking the ARM assembly can
> > improve it. Maybe get help from Jacob on that.
> 
> GCC does a pretty good job on common cases, i.e. if you aren't doing
> anything crazy and on the boundaries of normal C. Still, it's always
> worth having a look to check.
> 
To be fair, the version of gcc that I am using is kind of old (gcc-4.4), which
iirc, does not have their amazing loop optimization library.

> Consider A8 the lowest common denominator in terms of ARM floating-point
> performance. A8's VFP is not pipelined. From its TRM:
This is the board that I've been using (Freescale MX51 Lange5.1 Board).

> Also, bear in mind that even on FP-heavy JavaScript, the overall
> percentage of VFP instructions involved is tiny compared to the loads
> and other overheads that we have. VFP performance still has an impact,
> but not as much as it would for C code.

This would be my guess, I'm going to take a look at what we generate for some numerically heavy code.

> I don't know what the C is, exactly, but I would guess that this test
> implicitly assumes that the compilation from x86 GCC (presumably with SSE 
> enabled? or is it just x87?) and ARM-with-VFP GCC is at parity, which I 
> seriously doubt.

I'll attach the C code in a bit.  I'm now compiling with -march=core2, which enables sse2, sse3 and ssse3.

> In order to get a solid comparison here I think comparing some FPU kernels 
> that can't be easily reordered (since all the ARM uprocs we'll have are 
> in-order) is a better test. Especially because in the JIT we hand craft the 
> code! :-)

I've added in two new tests, one that is basically a linear floating point test, and one that is basically a linear integer test.  The new results are:
x86:
double/mand:                     2.2143
double/logistic:                 2.0211
float/mand:                      2.2091
float/logistic:                  2.9499
int/fermat:                      1.3415
int/rand:                        2.0532
random_memory/merge_list_sort:   0.7088
arm:
double/mand:                    73.2609
double/logistic:                27.2753
float/mand:                     64.2513
float/logistic:                 29.5893
int/fermat:                     19.9106
int/rand:                       14.2950
random_memory/merge_list_sort:   5.9424

In this case, mand and logistic take about the same amount of time on x86, but mand is about twice as slow on arm, so as we expected, on x86 code with fewer dependencies is much faster.
extract with tar xzf foo.tgz
then 
cd hw; ./doit

If other people with working arm devices running linux could test it out, and
give numbers for comparison, it would probably be good.
># uname -a
>Darwin Toms-iPod-Touch 11.0.0 Darwin Kernel Version 11.0.0: Thu Feb 10 >21:45:19 PST 2011; root:xnu-1735.46~2/RELEASE_ARM_S5L8922X iPod3,1 arm N18AP >Darwin
># gcc --version
>gcc (GCC) 4.2.1 (Based on Apple Inc. build 5555)
>Copyright (C) 2007 Free Software Foundation, Inc.
>This is free software; see the source for copying conditions.  There is NO
>warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

random_memory/merge_list_sort had build failures so i removed it.

double/mand:     104.6216
double/logistic:  36.9583
float/mand:       69.9893
float/logistic:   41.0557
int/fermat:       61.7327
int/rand:         20.3462

The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.

Assignee: marty.rosenberg → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: