671957 - Compare simple C kernels on ARM and i7

Reporter

Description

•

13 years ago

Stuart told me it would be awesome if websites could work on mobile just as well as on desktop. For JS, this basically means performance. I'm not sure yet how to think about time and space tradeoffs, but for now maybe we can think of it as how much speed we can get with an acceptable space cost.

There are 3 main questions here:

1. What is the current state of JS perf on mobile ARM devices compared to desktop?
2. How close can we get JS perf on ARM to desktop?
3. What do we have to do to get that close?

Marty Rosenberg [:mjrosenb]

Comment 1

•

13 years ago

So I decided to see how various aspects of the hardware compare between x86 and arm.  With a rather simplistic test, I got these numbers (in seconds for the run)
800 mhz arm:
double:              73.965850
single:              64.780529
integer:             20.124182
random memory access: 5.966641
2300 mhz core i7: (about 4 times the clock speed):

double:               5.851761
single:               5.861845
int:                  4.758156
random memory access: 1.510734

these tests were written in C, compiled with -O3, and turning on vfp on arm.  It seems like the arm floating point code is running at less than a tenth of the x86 floating point code.  Arm's integer and the memory accesses both seem to be keeping up with x86.  Unfortunately, this seems to indicate that code that is floating point heavy will have lots of trouble keeping up.  When I get some newer hardware, I'll test with that.

Chris Leary [:cdleary] (not checking bugmail)

Comment 2

•

13 years ago

(In reply to comment #1)

I don't know what the C is, exactly, but I would guess that this test implicitly assumes that the compilation from x86 GCC (presumably with SSE enabled? or is it just x87?) and ARM-with-VFP GCC is at parity, which I seriously doubt.

In order to get a solid comparison here I think comparing some FPU kernels that can't be easily reordered (since all the ARM uprocs we'll have are in-order) is a better test. Especially because in the JIT we hand craft the code! :-)

David Mandelin [:dmandelin]

Reporter

Comment 3

•

13 years ago

(In reply to comment #1)
> It seems like the arm floating point code is running at less than a tenth of
> the x86 floating point code.  Arm's integer and the memory accesses both
> seem to be keeping up with x86.  Unfortunately, this seems to indicate that
> code that is floating point heavy will have lots of trouble keeping up. 
> When I get some newer hardware, I'll test with that.

Nice! This is exactly the kind of basic info we need.

I think the i7 is only 3x the clock speed in this case, right? For int and memory, you measured a 4x difference. It seems like that could be due to things like 'width' (is that the right term? I mean things like fetch bandwidth and number of functional units), reordering, and/or memory bus speed.

- I'd be curious about a variant on Chris's idea: make a little integer kernel in C, test that, and then see if hand-hacking the ARM assembly can improve it. Maybe get help from Jacob on that.

- Another key question is: Are the benchmark scores in line with the results above? I.e., do we tend to run 10x slower on the fp ones and 4x slower on the others? That would be great to know.

(Note: I intended for this to be a meta bug, and each experiment on this general topic would be a separate bug linking to it. See bug 642003 for a prior example of this bug organization. Don't worry about it here, I'll make a new meta bug and morph this one.)

David Mandelin [:dmandelin]

Reporter

Updated

•

13 years ago

Assignee: dmandelin → mrosenberg

Summary: [meta] Understand JS performance on ARM → Compare simple C kernels on ARM and i7

David Mandelin [:dmandelin]

Reporter

Updated

•

13 years ago

Blocks: 673000

Chris Leary [:cdleary] (not checking bugmail)

Comment 4

•

13 years ago

(In reply to comment #3)
> - I'd be curious about a variant on Chris's idea: make a little integer
> kernel in C, test that, and then see if hand-hacking the ARM assembly can
> improve it. Maybe get help from Jacob on that.

Marty and I also chatted about this a little bit IRL -- if you make a micro-kernel with a single long (loop carried) dependency chain the issue width of your hardware won't matter and your reordering window won't help you at all. You can scale the number of simultaneous dependency chains up to see where the issue width chokes on the "less super" scalar pipelines.

Jacob Bramley [:jbramley]

Comment 5

•

13 years ago

(In reply to comment #3)
> - I'd be curious about a variant on Chris's idea: make a little integer
> kernel in C, test that, and then see if hand-hacking the ARM assembly can
> improve it. Maybe get help from Jacob on that.

GCC does a pretty good job on common cases, i.e. if you aren't doing
anything crazy and on the boundaries of normal C. Still, it's always
worth having a look to check.

Consider A8 the lowest common denominator in terms of ARM floating-point
performance. A8's VFP is not pipelined. From its TRM:

"The VFP coprocessor is a nonpipelined floating-point execution engine
that can execute any VFPv3 data-processing instruction. Each instruction
runs to completion before the next instruction can issue, and there is
no forwarding of VFP results to other instructions."

You can find this at infocenter.arm.com ("Cortex-A series processors" ->
"Cortex-A8" -> "r3p2" -> "Instruction Cycle Timing" -> "VFP instructions").

A9's VFP is entirely different, and it's very fast. I suspect that the
likes of Qualcomm's Snapdragon (in several HTC phones) also have fast
VFP, though I've not measured them.

Also, bear in mind that even on FP-heavy JavaScript, the overall
percentage of VFP instructions involved is tiny compared to the loads
and other overheads that we have. VFP performance still has an impact,
but not as much as it would for C code.

Marty Rosenberg [:mjrosenb]

Comment 6

•

13 years ago

(In reply to comment #5)
> (In reply to comment #3)
> > - I'd be curious about a variant on Chris's idea: make a little integer
> > kernel in C, test that, and then see if hand-hacking the ARM assembly can
> > improve it. Maybe get help from Jacob on that.
> 
> GCC does a pretty good job on common cases, i.e. if you aren't doing
> anything crazy and on the boundaries of normal C. Still, it's always
> worth having a look to check.
> 
To be fair, the version of gcc that I am using is kind of old (gcc-4.4), which
iirc, does not have their amazing loop optimization library.

> Consider A8 the lowest common denominator in terms of ARM floating-point
> performance. A8's VFP is not pipelined. From its TRM:
This is the board that I've been using (Freescale MX51 Lange5.1 Board).

> Also, bear in mind that even on FP-heavy JavaScript, the overall
> percentage of VFP instructions involved is tiny compared to the loads
> and other overheads that we have. VFP performance still has an impact,
> but not as much as it would for C code.

This would be my guess, I'm going to take a look at what we generate for some numerically heavy code.

> I don't know what the C is, exactly, but I would guess that this test
> implicitly assumes that the compilation from x86 GCC (presumably with SSE 
> enabled? or is it just x87?) and ARM-with-VFP GCC is at parity, which I 
> seriously doubt.

I'll attach the C code in a bit.  I'm now compiling with -march=core2, which enables sse2, sse3 and ssse3.

> In order to get a solid comparison here I think comparing some FPU kernels 
> that can't be easily reordered (since all the ARM uprocs we'll have are 
> in-order) is a better test. Especially because in the JIT we hand craft the 
> code! :-)

I've added in two new tests, one that is basically a linear floating point test, and one that is basically a linear integer test.  The new results are:
x86:
double/mand:                     2.2143
double/logistic:                 2.0211
float/mand:                      2.2091
float/logistic:                  2.9499
int/fermat:                      1.3415
int/rand:                        2.0532
random_memory/merge_list_sort:   0.7088
arm:
double/mand:                    73.2609
double/logistic:                27.2753
float/mand:                     64.2513
float/logistic:                 29.5893
int/fermat:                     19.9106
int/rand:                       14.2950
random_memory/merge_list_sort:   5.9424

In this case, mand and logistic take about the same amount of time on x86, but mand is about twice as slow on arm, so as we expected, on x86 code with fewer dependencies is much faster.

Marty Rosenberg [:mjrosenb]

Comment 7

•

13 years ago

Attached file .tar.gz of the miniature benchmarking suite that I wrote — Details

extract with tar xzf foo.tgz
then 
cd hw; ./doit

If other people with working arm devices running linux could test it out, and
give numbers for comparison, it would probably be good.

Tom S [:evilpie]

Comment 8

•

13 years ago

># uname -a
>Darwin Toms-iPod-Touch 11.0.0 Darwin Kernel Version 11.0.0: Thu Feb 10 >21:45:19 PST 2011; root:xnu-1735.46~2/RELEASE_ARM_S5L8922X iPod3,1 arm N18AP >Darwin
># gcc --version
>gcc (GCC) 4.2.1 (Based on Apple Inc. build 5555)
>Copyright (C) 2007 Free Software Foundation, Inc.
>This is free software; see the source for copying conditions.  There is NO
>warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

random_memory/merge_list_sort had build failures so i removed it.

double/mand:     104.6216
double/logistic:  36.9583
float/mand:       69.9893
float/logistic:   41.0557
int/fermat:       61.7327
int/rand:         20.3462

BugBot [:suhaib / :marco/ :calixte]

Comment 9

•

2 years ago

The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.

Assignee: marty.rosenberg → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Bugzilla

Quick Search

Compare simple C kernels on ARM and i7

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: dmandelin, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Attachment

General

Description

File Name

Content Type