Closed
Bug 443237
Opened 17 years ago
Closed 7 years ago
investigate int64 jsval slowdown on 32-bit x86
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
INVALID
mozilla1.9.1
People
(Reporter: brendan, Assigned: mohammad.r.haghighat)
Details
Attachments
(4 files)
9.90 KB,
patch
|
Details | Diff | Splinter Review | |
12.74 KB,
text/plain
|
Details | |
13.03 KB,
text/plain
|
Details | |
2.87 KB,
patch
|
Details | Diff | Splinter Review |
The attached patch is fresh. If it has bugs, my apologies -- I hacked it up last week and IIRC it passed the JS testsuite, but you should reconfirm that. We run the suite by cd'ing to js/tests and typing
./jsDriver.pl -t -e smdebug -L lc2 lc3 spidermonkey-n.tests slow-n.tests
after building the debug ("smdebug") SpiderMonkey js shell in js/src via
make -f Makefile.ref
I built optimized via
make BUILD_OPT=1 OPTIMIZER=-Os -f Makefile.ref
on my MacBook Pro (since Apple's gcc does better with -Os than stock gcc, IIRC, I always build optimized this way) and saw ~27% slowdown in the js-shell-based SunSpider benchmark (from http://webkit.org/) when I enlarged jsval to int64 but kept jsval int domain restricted to 31 bits. When I enlarged ints to 32 bits I got back 1-2%.
Disappointing but not totally surprising to me. If there is an easy way to win back the lost perf, we are interested. If you guys can show why we lose perf, that would be helpful to guide future work. In any case, Moh kindly offered to help investigate this question, so I'm giving him the bug and patch.
/be
Comment 1•17 years ago
|
||
The patch includes jsinterp.cpp and jsscope.cpp, while in the mozilla source tree these files are .c files. Is there any particular reason for this?
Thanks.
--
Carmen
Reporter | ||
Comment 2•17 years ago
|
||
Hi Carmen, we're using
http://hg.mozilla.org/mozilla-central
now, cvs.mozilla.org is for Mozilla 1.9.0.x maintenance releases (Firefox 3.0.x, I think) only. For more on Mercurial, see
http://developer.mozilla.org/en/docs/Mercurial
/be
Comment 3•17 years ago
|
||
Hello,
I have built both the debug and optimized versions of the JS Shell using the original JS sources from Mozilla Central on Windows w/ the Visual C++ compiler.
I have changed a few files in order to get the JS Shell to build, namely changed jsinterp.c to jsinterp.cpp in the rules.mk file on line 99 and added an explicit cast to uint32* in js.cpp on line 858.
However, when I run jsDriver.pl I get the following errors:
1) Running jsDriver.pl w/ the debug, unoptimized, original version of Mozilla Central's JS Shell:
./jsDriver.pl -t -e smdebug -L lc2 lc3 spidermonkey-n.tests slow-n.tests
-*- executing: ./../src/WINNT5.1_DBG.OBJ/js.exe -f ./shell.js -f ./js1_5/shell.js -f ./js1_5/Regress/shell.js -f ./js1_5/Regress/regress-281487.js -f ./js-test-driver-end.js
An unhandled win 32 exception occurred in js.exe [1344]:
js.cpp, line 272 : Unhandled exception at 0x610b12f4 in js.exe: 0xC0000005: Access violation reading location 0x00045174.
jsDriver.pl resumed tests' execution after exiting the Visual Studio debugger.
FINAL OUTPUT:
-#- 22 test(s) failed
-------------------------------------------------------------------------------------------------------------------
2) Running jsDriver.pl w/ the optimized, original version of Mozilla Central's JS Shell:
./jsDriver.pl -t -e smopt -L lc2 lc3 spidermonkey-n.tests slow-n.tests
-*- executing: ./../src/WINNT5.1_OPT.OBJ/js.exe -f ./shell.js -f ./e4x/shell.js -f ./e4x/decompilation/shell.js -f ./e4x/decompilation/decompile-xml-escapes.js -f ./js-test-driver-end.js
An unhandled win 32 exception occurred in js.exe [6124]:
_file.c, line 238: Unhandled exception at 0x7c918fea in js.exe: 0xC0000005: Access violation writing location 0x00000010.
The error repeats for the subsequent tests. Note that this is the original code from mozilla.central. Also, in the debug version, this error occurs in js.cpp while in the optimized version, the error happens in _file.c.
Do you know what is causing this problem? Should I be testing this on iMac instead?
Thank you.
--
Carmen
Comment 4•17 years ago
|
||
Can you post a stack dump? I will look at the code in the meantime from the line number info.
Comment 5•17 years ago
|
||
This looks like a weird bug. A stack dump would help a lot. In the meantime I suggest you switch to mac if possible. Brendan and I both use mac, so that tends to be what we test again.
Comment 6•17 years ago
|
||
Here's the call stack for the first error(from Visual Studio):
js32.dll!61076ef4()
[Frames below may be incorrect and/or missing, no symbols loaded for js32.dll]
js32.dll!610853ff()
js32.dll!61076d88()
msvcr80.dll!78134c58()
js32.dll!610b12cd()
msvcr80.dll!78134c58()
js32.dll!6100e113()
js32.dll!61084d06()
msvcr80.dll!7813ee63()
js32.dll!61064205()
js32.dll!6104b601()
js32.dll!6108db72()
js32.dll!610a4718()
js32.dll!610a4718()
js32.dll!61015794()
js32.dll!6103b61a()
js32.dll!610369f3()
js32.dll!6102e19a()
js32.dll!6103b415()
js32.dll!6102e371()
ntdll.dll!7c910732()
ntdll.dll!7c910732()
ntdll.dll!7c911596()
js32.dll!6108ea39()
js32.dll!6108ea8a()
js32.dll!610a0d23()
js32.dll!6108e021()
js32.dll!6108dcdc()
js32.dll!610a4718()
js32.dll!6108db72()
js32.dll!610a4718()
js32.dll!6103b303()
js32.dll!6102e19a()
js32.dll!6102dd10()
ntdll.dll!7c910e91()
ntdll.dll!7c91056d()
js32.dll!610a4718()
ntdll.dll!7c91056d()
msvcr80.dll!78134c39()
msvcr80.dll!78134c58()
js32.dll!6103c059()
js32.dll!61015983()
js32.dll!61086043()
js32.dll!6108679b()
js32.dll!610867b1()
js32.dll!61062b59()
js32.dll!6101375d()
> js.exe!Process(JSContext * cx=0x00a0b398, JSObject * obj=0x00b11000, char * filename=0x00a05f26, int forceTTY=0x00000000) Line 272 + 0x16 bytes C++
js.exe!ProcessArgs(JSContext * cx=0x00a0b398, JSObject * obj=0x00b11000, char * * argv=0x00a05e84, int argc=0x0000000a) Line 500 + 0x19 bytes C++
js.exe!main(int argc=0x0000000a, char * * argv=0x00a05e84, char * * envp=0x00a041c0) Line 3940 + 0x15 bytes C++
js.exe!__tmainCRTStartup() Line 586 + 0x17 bytes C
kernel32.dll!7c816ff7()
Here's the call stack for the second error:
ntdll.dll!7c918fea()
[Frames below may be incorrect and/or missing, no symbols loaded for ntdll.dll]
ntdll.dll!7c915041()
ntdll.dll!7c915233()
ntdll.dll!7c9155c9()
ntdll.dll!7c90104b()
> js32.dll!_lock_file(_iobuf * pf=0x0041ed20) Line 238 C
js32.dll!getc(_iobuf * stream=0x0041ed20) Line 70 + 0x6 bytes C
js32.dll!_js_fgets() + 0x35 bytes C++
js32.dll!_js_GetToken() + 0x285e bytes C++
js32.dll!_js_GetToken() + 0x6f0 bytes C++
js32.dll!_js_PeekToken() + 0x35 bytes C++
js32.dll!_js_CompileScript() + 0x102 bytes C++
js32.dll!_JS_CompileFileHandleForPrincipals() + 0x37 bytes C++
js32.dll!_JS_CompileFileHandle() + 0x16 bytes C++
js.exe!_main() + 0x877 bytes C++
js.exe!_main() + 0x660 bytes C++
js.exe!_main() + 0x1a6 bytes C++
js.exe!__tmainCRTStartup() Line 318 + 0x12 bytes C
kernel32.dll!7c816ff7()
In the meantime,I'll build the JS shell on the iMac we have.
Thanks for your help,
Carmen
Comment 7•17 years ago
|
||
Hello,
I have built the debug and optimized versions of the original JS code from Mozilla Central on the iMac we have. So far I ran the tests using the jsDriver.pl for both versions and some tests failed. For the debug version, the following tests failed:
e4x/decompilation/decompile-xml-escapes.js
e4x/Expressions/11.1.4-08.js
e4x/Global/13.1.2.1.js
e4x/Namespace/regress-292863.js
e4x/TypeConversion/10.2.1.js
ecma/Math/15.8.2.6.js
ecma_3/RegExp/regress-311414.js
ecma_3/String/15.5.4.11.js
ecma_3/String/regress-392378.js
js1_5/extensions/regress-322957.js
js1_5/Regress/regress-320119.js
js1_7/geniter/regress-347739.js
js1_7/geniter/regress-349012-01.js
js1_7/geniter/regress-349331.js
js1_7/iterable/regress-340526-02.js
js1_7/lexical/regress-346642-03.js
js1_7/regress/regress-410649.js
The same tests, plus ecma/TypeConversion/9.3.1-3.js, failed for the optimized version.
Are these failures to be expected?
Thanks.
--
Carmen
Comment 8•17 years ago
|
||
Yeah, those look familiar at a quick scan. You can baseline the set of failures against the unpatched tree, and use that to verify the patch. (We should get those tests fixed or excluded promptly, IMO, but I haven't been pulling my weight there lately. :/)
Comment 9•17 years ago
|
||
Hello,
After building the debug and optimized versions of the patched JS code, the js-shell-based SunSpider benchmarks show a performance degradation of ~10% for the debug version and ~17% for the optimized version on the iMac I used.
However, many more tests fail for the jsDriver.pl, namely 91 for the debug version and 69 for the optimized version.
I will look into finding out why this is happening using Shark.
--
Carmen
Comment 10•17 years ago
|
||
The 91 failures are expected. The debug version contains a lot of debug/assertions overhead, so I think you can focus on the optimized build. For the non-failing cases identifying the reason for th 17% slowdown would be very helpful. I am betting on register pressure (more L1 traffic), not memory bandwidth issues (L2 and misses), but lets see what shark has to say.
Assignee | ||
Comment 11•17 years ago
|
||
The bottleneck is of course more visible in the optimized build. Number of retired instructions is also a good starting metric.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 12•17 years ago
|
||
91 failures with the debug js shell is too many -- I got 17 the other day. Let me refresh the patch and test myself...
/be
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 13•17 years ago
|
||
We have some internal tools for diff'ing the performance characteristics of a binary on two workloads or two builds of the same source using different compiler options. I'm not sure if it is available on iMac. Will investigate and provide an update.
Comment 14•17 years ago
|
||
Is sunspider working with the patch? Or is that failing like the 91 test cases?
Comment 15•17 years ago
|
||
The shell-based Sunspider works with the patch on iMac.
Comment 16•17 years ago
|
||
On Win32, it turned out that the original debug version is built with the -MD option while the optimized version is not using this switch. The crash of the optimized version (still without the patch) would occur in the system fgetc when running js.exe on Sunspider. We tried the -MD option on the optimized build. The problem goes away and Sunspider runs to completion.
The optimized version with the patch still crashes even with the -MD option. The crash occurs in jsgc.cpp. On its build, however, there are a lot of warnings about potential loss of data in the jsval to int32 conversion. These warning are not generated on the Mac build. Are they safe to ignore?
Some of the warnings:
jsgc.cpp(2240) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2555) : warning C4244: '=' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2675) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2682) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2701) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2703) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2705) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2747) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2809) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2816) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2822) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
jsgc.cpp(2851) : warning C4244: 'argument' : conversion from 'jsval' to 'uint32', possible loss of data
Assignee | ||
Comment 17•17 years ago
|
||
On Win32, the debug build of the patched version dies on this assertion failure: Line 1768 of jsgc.cpp:
/* Try to get thing from the free list. */
thing = arenaList->freeList;
if (thing) {
arenaList->freeList = thing->next;
flagp = thing->flagp;
JS_ASSERT(*flagp & GCF_FINAL);
The value of *flagp is 0x48 and GCF_FINAL is 32:
0x48: 0010 1000
32: 0001 0000
and: 0000 0000
Assignee | ||
Comment 18•17 years ago
|
||
(In reply to comment #17)
oops!
0x48: 0100 1000
32: 0010 0000
and: 0000 0000
Comment 19•17 years ago
|
||
Update:
We profiled the original and patched version of the JS Shell with Shark on the Mac OS X (there was a bit of trouble as Shark will not show mixed source and assembly unless we built the JS Shell with the -gstabs+ option).
I'll be looking into why the slowdown happens next.
Still it would be good to have the Win32 version of the patch working as VTune is not supported on Mac OS X.
--
Carmen
Comment 20•17 years ago
|
||
Update:
I have looked at the generated code for Mac OS X for both the original as well as the patched versions. The performance degradation of the patch on the optimized build is consistently ~17%. According to Shark, the only function that shows a major sample increase is the js_interpret function (no surprise). For this function, the increase in the samples is ~4.3%. I manually inspected some of the hot spots, and from the assembly code can see the extra instructions. So far, the extra instructions have been all extra moves to take care of the 64 bit vs. 32 bit data movements. At the first glance, it does not seem thee extra instructions are spill/reloads.
I plan to do a more thorough analysis by breaking down the instructions. We have a Pin tool (our binary instrumentation tool) that can help here. Unfortunately, it is not supported on Leopard. It works on Linux & windows. But, the patch doesn't work on Windows. So, I'll try to see if I can get it to work on Linux. In that case, we would be able to have a better understanding of the instruction mix profile with and without the patch.
If this doesn't work, I'll try to write scripts to statically find the instruction mix from the compiler generated assembly listings.
If you have a suggestion, please let me know.
Thanks.
Comment 21•17 years ago
|
||
I can't be much help with Windows, but if you have build issues with Linux I can take a look and we should be able to get that to work easily. I am building tracemonkey on mac & linux now and its mostly in sync with mozilla central, minus the jsval64 patch.
Can you post an example of the code differences? No spills is definitively good news. I wonder what exactly happens in the code. Did you succeed building the code on mac in 64-bit mode? (-m64) I wonder what the code looks like with more and wider registers.
Comment 22•17 years ago
|
||
Hopefully there won't be any issues on Linux, but if there's any trouble I'll definitely ask you for help :).
Here is the assembly code generated for the original JS Shell version for line 5539 in jsinterp.cpp (PUSH_OPND(fp->vars[slot])):
mov esi, dword ptr [ebp-396]
mov ecx, dword ptr [ebp-76]
mov edx, dword ptr [esi+52]
mov eax, dword ptr [edx+eax*4]
mov dword ptr [ecx], eax
lea eax, dword ptr [ecx+4]
mov dword ptr [ebp-76], eax
and here's the code for the same line with the patched version of the JS Shell:
mov esi, dword ptr [ebp-740]
mov ecx, dword ptr [ebp-92]
mov edx, dword ptr [esi+56]
lea edx, dword ptr [edx+eax*8]
mov eax, dword ptr [edx]
mov edx, dword ptr [edx+4]
mov dword ptr [ecx], eax
lea eax, dword ptr [ecx+8]
mov dword ptr [ecx+4], edx
mov dword ptr [ebp-92], eax
We haven't tried building in 64-bit mode yet...
--
Carmen
Comment 23•17 years ago
|
||
This is really disappointing. Looks like the loads & stores do not get fused and generate more bus traffic. I will hack up a microbenchmark to confirm this.
Comment 24•17 years ago
|
||
Looks like Brendan was right all along. Even when comparing 32-bit x86 with 32-bit access to 64-bit data movements in 64-bit x86 mode we take tank performance by 18%. 64-bit data movements in 32-bit mode have 80% overhead. Not sure icc would produce different results. Its probably difficult to schedule this tight loop differently. Pretty disappointing. This might mean a dead-end for 64-bit slots on x86.
64-bit x86 mode might we worth a second look. The 18% overhead we might be able to recover (big maybe though) from other wins (no GC for doubles etc).
#define I long
#define L long long
TT a[4096];
TT b[4096];
int main() {
int i, j;
for (i = 0; i < 500000; ++i)
for (j = 0; j < 4096; ++j)
a[j] = b[j];
}
h-233:tmp gal$ touch test.c && make CFLAGS="-O6 -DTT=I -m32" test
cc -O6 -DTT=I -m32 test.c -o test
h-233:tmp gal$ time ./test
real 0m1.671s
user 0m1.658s
sys 0m0.005s
h-233:tmp gal$ touch test.c && make CFLAGS="-O6 -DTT=I -m64" test
cc -O6 -DTT=I -m64 test.c -o test
h-233:tmp gal$ time ./test
real 0m1.981s
user 0m1.966s
sys 0m0.005s
h-233:tmp gal$ touch test.c && make CFLAGS="-O6 -DTT=L -m32" test
cc -O6 -DTT=L -m32 test.c -o test
h-233:tmp gal$ time ./test
real 0m2.944s
user 0m2.920s
sys 0m0.008s
h-233:tmp gal$ touch test.c && make CFLAGS="-O6 -DTT=L -m64" test
cc -O6 -DTT=L -m64 test.c -o test
h-233:tmp gal$ time ./test
real 0m1.981s
user 0m1.966s
sys 0m0.005s
similar results on linux x86_64 (core 2 duo)
dvander@mindknight:~/mozilla/tracemonkey/js/src$ cc test.c -O6 -DTT=I -m32 -otest
dvander@mindknight:~/mozilla/tracemonkey/js/src$ time ./test
real 0m0.968s
user 0m0.964s
sys 0m0.000s
dvander@mindknight:~/mozilla/tracemonkey/js/src$ cc test.c -O6 -DTT=I -m64 -otest
dvander@mindknight:~/mozilla/tracemonkey/js/src$ time ./test
real 0m2.168s
user 0m2.164s
sys 0m0.004s
dvander@mindknight:~/mozilla/tracemonkey/js/src$ cc test.c -O6 -DTT=L -m32 -otest
dvander@mindknight:~/mozilla/tracemonkey/js/src$ time ./test
real 0m3.078s
user 0m3.080s
sys 0m0.000s
dvander@mindknight:~/mozilla/tracemonkey/js/src$ cc test.c -O6 -DTT=L -m64 -otest
dvander@mindknight:~/mozilla/tracemonkey/js/src$ time ./test
real 0m2.140s
user 0m2.136s
sys 0m0.004s
Comment 26•17 years ago
|
||
Actually the 2nd test case is off. long is 64-bit in 64-bit mode. Performance for that probably matches the 1st case if we had used #define I int.
proc: Dual Core AMD Opteron(tm) Processor 170
dvander@hayate:~$ cc test.c -O6 -DTT=I -m32 -o test
dvander@hayate:~$ time ./test
real 0m5.547s
user 0m5.548s
sys 0m0.000s
dvander@hayate:~$ cc test.c -O6 -DTT=I -m64 -o test
dvander@hayate:~$ time ./test
real 0m5.744s
user 0m5.728s
sys 0m0.004s
dvander@hayate:~$ cc test.c -O6 -DTT=L -m32 -o test
dvander@hayate:~$ time ./test
real 0m11.795s
user 0m11.797s
sys 0m0.000s
dvander@hayate:~$ cc test.c -O6 -DTT=L -m64 -o test
dvander@hayate:~$ time ./test
real 0m8.490s
user 0m8.481s
sys 0m0.004s
Comment 28•17 years ago
|
||
David corrected long -> int for the 32-bit case. There is still serious slowdown for 64-bit data movement even in 64-bit mode and AMD and Intel seem to behave identically.
Assignee | ||
Comment 29•17 years ago
|
||
Once we get the Pin dynamic instruction-mix profiles, we'll have a better understanding.
Reporter | ||
Updated•17 years ago
|
Status: REOPENED → ASSIGNED
Comment 30•17 years ago
|
||
Update:
I've built the original and patched optimized versions of the JS Shell on a Linux machine. With the jsDriver.pl, 16 tests fail for the original JSShell and 70 tests fail for the patched JSShell.
I have also run the SunSpider benchmarks. On Linux, the patched optimized JSShell shows only a 9.2% performance degradation compared to the original optimized JSShell. This is much better a situation than on Mac, where the overhead was 17%.
I will see what further info we can get regarding dynamic instruction-mix profiles using Pin.
--
Carmen
Comment 31•17 years ago
|
||
Very interesting! I am not worried about the 70 fails for now.
Comment 32•17 years ago
|
||
Are you compiling with gcc or with icc? David is telling me gcc on Mac is from the last millenium. Linux+icc might be quite interesting if that works out of the box.
Assignee | ||
Comment 33•17 years ago
|
||
Thi is still with gcc. The situation with icc can be different or similar. I think, next we'll do the instruction mix study before trying icc.
Comment 34•17 years ago
|
||
Comment 35•17 years ago
|
||
Update:
I've uploaded the instruction mix results for both the original and the patched version on one run of Sunspider. We do have the results for all functions, but the output size was larger than what Bugzilla allowed. So, I include the results of only js_interpret and the total.
Here is the sorted list of the summary:
type original patched %increase
mem-write-4 1691677705 2167068651 28.1
stack-write 1420936462 1752317798 23.3
mem-read-4 2857687955 3224465488 12.8
stack-read 1893735049 2099248483 10.9
mem-read-1 398201620 399689215 0.4
mem-read-variable 2018520 2019241 0.0
mem-write-variable 2261468 2262255 0.0
mem-atomic 48 48 0.0
mem-write-1 14588693 14574886 -0.1
mem-read-2 99033945 96835214 -2.2
mem-write-2 33464422 32001450 -4.4
mem-read-8 39821162 37870002 -4.9
mem-write-8 47275542 44860871 -5.1
TOTAL 9313636391 10742292707 15.3
The attached files show the number of executed instructions of each possible type. Meanwhile, I'll take a closer look, you're welcome to do so as well.
Assignee | ||
Comment 36•17 years ago
|
||
A comparative study of instruction mix of JSShell and TT would also be interesting.
Comment 37•17 years ago
|
||
Update:
I have also built the original and patched version of the JS Shell on iMac using the Intel C++ Compiler as follows:
- w/ 64-bit ICPC (BUILD_OPT=1 OPTIMIZER=-Os): the patched JS Shell shows a 1.7% improvement over the original JS Shell on the Sunspider benchmarks
This is encouraging news.
- w/ 32-bit ICPC (BUILD_OPT=1 OPTIMIZER=-O2): the patched JS Shell shows ~10% performance degradation over the original JS Shell on the Sunspider benchmarks
On iMac, out of the compilers I've used to build the original JS Shell(g++ -Os, 64-bit ICPC -Os and 32-bit ICPC -O2), the JS Shell built w/ the 32-bit ICPC does sligtly better than the others on the Sunspider benchmarks.
All these results are without PGO.
Comment 38•17 years ago
|
||
What do you see on 32-bit with OPTIMIZER=-Os? We generally do better with -Os than with -O2, maybe that's not the case here?
Comment 39•17 years ago
|
||
With the 32-bit ICPC and the -Os option, the JS Shell crashes with a bus error.
Someone from the Intel compiler team is looking into this specific problem.
Comment 40•17 years ago
|
||
I used vprof to profile the Mozilla Central JS Shell on Sunspider benchmarks. Here are the execution counts of all the executed JS OPs. Any ideas on why there are so many "nops"? This is an optimized version: BUILT_OPT=1, OPTIMIZER=-Os.
Thanks,
Carmen
OP COUNT
------ --------
getvar 66437114
nop 35041756
getarg 16673490
setvar 7608591
add 5810054
lt 5520459
getelem 5127731
int8 5117817
name 4198866
one 3922465
call 3810889
bitand 3789341
return 3519948
mul 3032927
varinc 2984431
getgvar 2874105
setelem 2829807
setname 2450781
sub 2418990
rsh 2402669
callname 2101845
lsh 1725135
zero 1617453
uint16 1604418
pop 1577637
callvar 1515630
ifeq 1427404
callarg 1318400
goto 1172378
bindname 1052321
getprop 940707
forvar 750556
getvarprop 736472
le 676907
div 659412
popv 652447
uint24 650968
getthisprop 649996
gvarinc 608725
dup 603585
callprop 599996
eq 584141
gt 552905
dup2 501730
bitnot 433304
group 365130
FALSE 359971
neg 352021
incvar 344020
ge 340196
ifne 312143
getargprop 301731
getxprop 265970
not 263939
bitxor 258988
TRUE 258248
bitor 247140
stop 241568
this 218677
mod 217847
or 172517
double 161927
string 159337
null 158148
setprop 127892
initelem 87504
ne 70076
new 68861
enditer 58505
iter 58505
ursh 52617
setarg 45586
length 45117
incgvar 43215
endinit 38347
newinit 38347
vardec 32045
nameinc 31574
int32 12613
and 11681
argdec 9830
setgvar 9123
stricteq 7502
typeof 7502
strictne 7500
initprop 5037
regexp 4192
propinc 408
anonfunobj 226
deffun 137
forname 132
defvar 92
push 4
deflocalfun 1
lineno 1
Comment 41•17 years ago
|
||
(In reply to comment #39)
> With the 32-bit ICPC and the -Os option, the JS Shell crashes with a bus error.
> Someone from the Intel compiler team is looking into this specific problem.
>
The person from the Intel compiler team who was looking into this problem got back to me. He identified the problem to be the jsdhash.cpp file.
As such I have built the original and patched versions of the Mozilla Central JS Shell using the 32-bit Intel C++ compiler with the OPTIMIZER=-Os option and only the jsdhash.cpp file with the -O2 option (on iMac).
I have also built the orginal and patched JS Shell w/ PGO on jsinterp.cpp only.
I ran the Sunspider benchmarks using all the built versions of JS Shell. Here are the performance results:
BUILD ORIGINAL PATCHED %DIFF
ICPC32+O2 2114.7 2270 7.34
ICPC32+Os* 2164.3 2322.6 7.31
ICPC32+O2+PGO_jsinterp.cpp 2015.2 2130.1 5.70
ICPC32+Os*+PGO_jsinterp.cpp 2151.7 2295 6.66
*jsdhash.cpp built w/ -O2
Out of all the builds, the combination that has the smallest performance degradation is using ICC w/ -O2 & PGO and that is 5.7%. Moreover, this combination gives the best raw numbers as well.
--
Carmen
Reporter | ||
Comment 42•17 years ago
|
||
(In reply to comment #40)
> I used vprof to profile the Mozilla Central JS Shell on Sunspider benchmarks.
> Here are the execution counts of all the executed JS OPs. Any ideas on why
> there are so many "nops"? This is an optimized version: BUILT_OPT=1,
> OPTIMIZER=-Os.
>
> Thanks,
> Carmen
>
> OP COUNT
> ------ --------
> getvar 66437114
> nop 35041756
How did you instrument these ops? Could you attach the patch you applied?
/be
Comment 43•17 years ago
|
||
(In reply to comment #42)
> (In reply to comment #40)
> > I used vprof to profile the Mozilla Central JS Shell on Sunspider benchmarks.
> > Here are the execution counts of all the executed JS OPs. Any ideas on why
> > there are so many "nops"? This is an optimized version: BUILT_OPT=1,
> > OPTIMIZER=-Os.
> >
> > Thanks,
> > Carmen
> >
> > OP COUNT
> > ------ --------
> > getvar 66437114
> > nop 35041756
>
>
> How did you instrument these ops? Could you attach the patch you applied?
>
> /be
>
After having vprof.cpp and vprof.h as part of the build, the only thing you need to do is redefine the BEGIN_CASE(OP) macro as follows:
#define BEGIN_CASE(OP) case OP: _nvprof(js_CodeName[op], 0);
But, in order to get the js_CodeName[op] right in an OPTIMIZED build, lines 108 and 118 in jsopcode.cpp need to be uncommented. These are the lines that guard the initialization of js_CodeName.
vprof.cpp and vprof.h are in the Tamarin Tracing distribution under the vprof directory.
I have done other changes also to my copy of mozillacentral, so a patch would show those up too. If it's really necessary, I can create a patch off another copy of mozillacentral.
--
Carmen
Comment 44•17 years ago
|
||
vprof.cpp and vprof.h are available in the tracemonkey repository
Reporter | ||
Comment 45•17 years ago
|
||
The threaded interpreter case handles JSOP_NOP with ADD_EMPTY_CASE which in turn involves BEGIN_CASE, but you show a case OP: label, which says you are using the non-threaded (switch in a for (;;) loop) version -- which does not layer empty cases such as JSOP_NOP on top of BEGIN_CASE. Did you modify the !JS_THREADED_INTERP definition of ADD_EMPTY_CASE too?
Again a patch, even by email, would help me render help here most quickly.
/be
Reporter | ||
Comment 46•17 years ago
|
||
Oops, I'm wrong: ADD_EMPTY_CASE is layered on BEGIN_CASE in both versions, threaded and switch-loop. Still, I find the NOP count hard to understand. So if you could mail me the patch, that would be ideal. Thanks,
/be
Comment 47•17 years ago
|
||
Ok, I'll get a fresh copy, apply my changes and create a patch.
FYI, the vproffed JS Shell results were collected on Windows.
--
Carmen
Comment 48•17 years ago
|
||
The attached patch enables the collecting of execution counts for each executed JS opcode.
I have built the vproffed JS Shell (from Mozilla Central) on Windows w/ the Visual 2008 C++ compiler as follows:
make BUILD_OPT=1 OPTIMIZER="-Os -MD" -f Makefile.ref
--
Carmen
Comment 49•7 years ago
|
||
The value representation has changed since this bug was filed
Status: ASSIGNED → RESOLVED
Closed: 17 years ago → 7 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•