Open
Bug 671957
Opened 13 years ago
Updated 2 years ago
Compare simple C kernels on ARM and i7
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
NEW
People
(Reporter: dmandelin, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
2.96 KB,
application/x-gzip
|
Details |
Stuart told me it would be awesome if websites could work on mobile just as well as on desktop. For JS, this basically means performance. I'm not sure yet how to think about time and space tradeoffs, but for now maybe we can think of it as how much speed we can get with an acceptable space cost. There are 3 main questions here: 1. What is the current state of JS perf on mobile ARM devices compared to desktop? 2. How close can we get JS perf on ARM to desktop? 3. What do we have to do to get that close?
Comment 1•13 years ago
|
||
So I decided to see how various aspects of the hardware compare between x86 and arm. With a rather simplistic test, I got these numbers (in seconds for the run) 800 mhz arm: double: 73.965850 single: 64.780529 integer: 20.124182 random memory access: 5.966641 2300 mhz core i7: (about 4 times the clock speed): double: 5.851761 single: 5.861845 int: 4.758156 random memory access: 1.510734 these tests were written in C, compiled with -O3, and turning on vfp on arm. It seems like the arm floating point code is running at less than a tenth of the x86 floating point code. Arm's integer and the memory accesses both seem to be keeping up with x86. Unfortunately, this seems to indicate that code that is floating point heavy will have lots of trouble keeping up. When I get some newer hardware, I'll test with that.
Comment 2•13 years ago
|
||
(In reply to comment #1) I don't know what the C is, exactly, but I would guess that this test implicitly assumes that the compilation from x86 GCC (presumably with SSE enabled? or is it just x87?) and ARM-with-VFP GCC is at parity, which I seriously doubt. In order to get a solid comparison here I think comparing some FPU kernels that can't be easily reordered (since all the ARM uprocs we'll have are in-order) is a better test. Especially because in the JIT we hand craft the code! :-)
Reporter | ||
Comment 3•13 years ago
|
||
(In reply to comment #1) > It seems like the arm floating point code is running at less than a tenth of > the x86 floating point code. Arm's integer and the memory accesses both > seem to be keeping up with x86. Unfortunately, this seems to indicate that > code that is floating point heavy will have lots of trouble keeping up. > When I get some newer hardware, I'll test with that. Nice! This is exactly the kind of basic info we need. I think the i7 is only 3x the clock speed in this case, right? For int and memory, you measured a 4x difference. It seems like that could be due to things like 'width' (is that the right term? I mean things like fetch bandwidth and number of functional units), reordering, and/or memory bus speed. - I'd be curious about a variant on Chris's idea: make a little integer kernel in C, test that, and then see if hand-hacking the ARM assembly can improve it. Maybe get help from Jacob on that. - Another key question is: Are the benchmark scores in line with the results above? I.e., do we tend to run 10x slower on the fp ones and 4x slower on the others? That would be great to know. (Note: I intended for this to be a meta bug, and each experiment on this general topic would be a separate bug linking to it. See bug 642003 for a prior example of this bug organization. Don't worry about it here, I'll make a new meta bug and morph this one.)
Reporter | ||
Updated•13 years ago
|
Assignee: dmandelin → mrosenberg
Summary: [meta] Understand JS performance on ARM → Compare simple C kernels on ARM and i7
Comment 4•13 years ago
|
||
(In reply to comment #3) > - I'd be curious about a variant on Chris's idea: make a little integer > kernel in C, test that, and then see if hand-hacking the ARM assembly can > improve it. Maybe get help from Jacob on that. Marty and I also chatted about this a little bit IRL -- if you make a micro-kernel with a single long (loop carried) dependency chain the issue width of your hardware won't matter and your reordering window won't help you at all. You can scale the number of simultaneous dependency chains up to see where the issue width chokes on the "less super" scalar pipelines.
Comment 5•13 years ago
|
||
(In reply to comment #3) > - I'd be curious about a variant on Chris's idea: make a little integer > kernel in C, test that, and then see if hand-hacking the ARM assembly can > improve it. Maybe get help from Jacob on that. GCC does a pretty good job on common cases, i.e. if you aren't doing anything crazy and on the boundaries of normal C. Still, it's always worth having a look to check. Consider A8 the lowest common denominator in terms of ARM floating-point performance. A8's VFP is not pipelined. From its TRM: "The VFP coprocessor is a nonpipelined floating-point execution engine that can execute any VFPv3 data-processing instruction. Each instruction runs to completion before the next instruction can issue, and there is no forwarding of VFP results to other instructions." You can find this at infocenter.arm.com ("Cortex-A series processors" -> "Cortex-A8" -> "r3p2" -> "Instruction Cycle Timing" -> "VFP instructions"). A9's VFP is entirely different, and it's very fast. I suspect that the likes of Qualcomm's Snapdragon (in several HTC phones) also have fast VFP, though I've not measured them. Also, bear in mind that even on FP-heavy JavaScript, the overall percentage of VFP instructions involved is tiny compared to the loads and other overheads that we have. VFP performance still has an impact, but not as much as it would for C code.
Comment 6•13 years ago
|
||
(In reply to comment #5) > (In reply to comment #3) > > - I'd be curious about a variant on Chris's idea: make a little integer > > kernel in C, test that, and then see if hand-hacking the ARM assembly can > > improve it. Maybe get help from Jacob on that. > > GCC does a pretty good job on common cases, i.e. if you aren't doing > anything crazy and on the boundaries of normal C. Still, it's always > worth having a look to check. > To be fair, the version of gcc that I am using is kind of old (gcc-4.4), which iirc, does not have their amazing loop optimization library. > Consider A8 the lowest common denominator in terms of ARM floating-point > performance. A8's VFP is not pipelined. From its TRM: This is the board that I've been using (Freescale MX51 Lange5.1 Board). > Also, bear in mind that even on FP-heavy JavaScript, the overall > percentage of VFP instructions involved is tiny compared to the loads > and other overheads that we have. VFP performance still has an impact, > but not as much as it would for C code. This would be my guess, I'm going to take a look at what we generate for some numerically heavy code. > I don't know what the C is, exactly, but I would guess that this test > implicitly assumes that the compilation from x86 GCC (presumably with SSE > enabled? or is it just x87?) and ARM-with-VFP GCC is at parity, which I > seriously doubt. I'll attach the C code in a bit. I'm now compiling with -march=core2, which enables sse2, sse3 and ssse3. > In order to get a solid comparison here I think comparing some FPU kernels > that can't be easily reordered (since all the ARM uprocs we'll have are > in-order) is a better test. Especially because in the JIT we hand craft the > code! :-) I've added in two new tests, one that is basically a linear floating point test, and one that is basically a linear integer test. The new results are: x86: double/mand: 2.2143 double/logistic: 2.0211 float/mand: 2.2091 float/logistic: 2.9499 int/fermat: 1.3415 int/rand: 2.0532 random_memory/merge_list_sort: 0.7088 arm: double/mand: 73.2609 double/logistic: 27.2753 float/mand: 64.2513 float/logistic: 29.5893 int/fermat: 19.9106 int/rand: 14.2950 random_memory/merge_list_sort: 5.9424 In this case, mand and logistic take about the same amount of time on x86, but mand is about twice as slow on arm, so as we expected, on x86 code with fewer dependencies is much faster.
Comment 7•13 years ago
|
||
extract with tar xzf foo.tgz then cd hw; ./doit If other people with working arm devices running linux could test it out, and give numbers for comparison, it would probably be good.
Comment 8•13 years ago
|
||
># uname -a
>Darwin Toms-iPod-Touch 11.0.0 Darwin Kernel Version 11.0.0: Thu Feb 10 >21:45:19 PST 2011; root:xnu-1735.46~2/RELEASE_ARM_S5L8922X iPod3,1 arm N18AP >Darwin
># gcc --version
>gcc (GCC) 4.2.1 (Based on Apple Inc. build 5555)
>Copyright (C) 2007 Free Software Foundation, Inc.
>This is free software; see the source for copying conditions. There is NO
>warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
random_memory/merge_list_sort had build failures so i removed it.
double/mand: 104.6216
double/logistic: 36.9583
float/mand: 69.9893
float/logistic: 41.0557
int/fermat: 61.7327
int/rand: 20.3462
Comment 9•2 years ago
|
||
The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.
Assignee: marty.rosenberg → nobody
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•