Closed
Bug 881537
Opened 12 years ago
Closed 2 years ago
Poor performance of asmjs when compiling Gambit Scheme interpreter
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: feeley, Unassigned)
References
Details
Attachments
(4 files)
Compiling Gambit Scheme with emscripten yields poor performance when asmjs is enabled. I have attached two versions of the self contained benchmark (to be run with the js shell): gambit-asmjs-bench.js is the version with asmjs enabled, and gambit-asmjs-bench.js is the version with asmjs disabled.
The programs iterates computing 1000 digits of pi 4 times. Here are the run times of the last iteration:
IonMonkey:
1.540 seconds gambit-asmjs-bench.js
0.668 seconds gambit-noasmjs-bench.js
V8:
1.021 seconds (same performance for both versions)
So IonMonkey is almost 3 times faster when asmjs is not used.
| Reporter | ||
Comment 1•12 years ago
|
||
Comment 2•12 years ago
|
||
May benefit from the ffi optimizations that are in progress.
Depends on: 860838
Comment 3•12 years ago
|
||
I also noticed it also takes a ton of time to compile with asm.js.
| Reporter | ||
Comment 4•12 years ago
|
||
Yes, although I'm not too worried about that right now. But it is worrysome that an optimized program runs much slower than an unoptimized one. If the style of asmjs code that is generated for Gambit is not one that is optimizable, the compiler should detect this and fall back to not optimizing it.
| Reporter | ||
Comment 5•12 years ago
|
||
I should also add that the same Scheme program (computing pi to 1000 digits) when run with the Gambit Scheme interpreter compiled with gcc (natively) only takes 0.004 seconds run time. This means the natively compiled interpreter is close to 400 times faster than when running on emscripten/asmjs/FF. Some of the slowdown can be attributed to emscripten not dealing well with the indirect gotos in the code. But still, there must be something else happening to account for 400x.
Comment 6•12 years ago
|
||
Ignoring the load time problem, I just tried to run (with a mozilla-inbound tip build) and I see a validation error:
gambit-asmjs-bench.js:3745:23403 warning: asm.js type error: int is not a subtype of double
Perhaps you could try using the latest Emscripten?
| Reporter | ||
Comment 7•12 years ago
|
||
| Reporter | ||
Comment 8•12 years ago
|
||
I am using the latest emscripten (the master from github) and the latest Gambit.
The script which I use to build gsi.js is attached. Running the script will clone the latest emscripten and Gambit sources, will bootstrap Gambit, and compile the Gambit interpreter using emscripten. The result will be in the file gambit/emscripten-gsi/gsi.js . So if you want to reproduce the steps on your own, and perhaps tweak some of the compilation options, feel free to try.
Comment 9•12 years ago
|
||
Here's one hot function, gcd. It may be interesting
that there are a lot of redundant variables being used.
For example $24, $31, $34, $37, $42, $47, $53, $57, and
more, are all constant and initialized to the same value.
Many of them are only used once, and are just offsets
from $___ps that could have been moved to their use site.
The macro-expanded C source for this might help analyse
it further. Is clang really doing such a poor job,
or is the LLVM not being interpreted well?
The generated code has large 'group move' operations
shuffling around these variables, and this would seem
to be unnecessary and might be something to look into.
The _trampoline() function is also hot, and perhaps
this needs to be very well optimized for this code.
Note the gcd function alone takes a long time to compile,
but is not that big. Perhaps this will be resolved with
the new parser. Can a parser really be that slow? Might
the parser be doing more than necessary? Is it possible
to just get an AST quickly that could be used for asm.js?
| Reporter | ||
Comment 10•12 years ago
|
||
Please note that the build-gambit-emscripten script that I attached uses "-s ASM_JS=0". If you want to generate asmjs code, please set to 1.
Comment 11•12 years ago
|
||
The problem in that gcd function is that it is not relooped. It uses indirect jumps - which Marc tells me is common in Gambit - which prevent emscripten from relooping the function. So we end up with the while-switch pattern, and emscripten currently cannot perform variable elimination on it very efficiently, we only do so inside basic blocks, that is, one case statement. Variables spanning case statements will not be eliminated.
| Reporter | ||
Comment 12•12 years ago
|
||
Here's a comparison with the native clang and gcc compilers on OS X, using -O2, and with either multiple_hosts mode (as was done for emscripten) or single_host mode (which typically gives faster code but can take very long to compile due to the huge C functions (one per Scheme module), particularly for older versions of clang). The run times for 1000 and 10000 digits of pi are:
native clang 3.2 + multiple_hosts: 5 ms / 68 ms
native clang 3.2 + single_host: 3 ms / 51 ms
native gcc 4.8 + multiple_hosts: 4 ms / 65 ms
native gcc 4.8 + single_host: 3 ms / 52 ms
So there's not much of a difference between native clang and gcc on this benchmark. On other benchmarks (with frequent function calls) there is a considerable difference in favor of gcc. But here control seems to mainly stay within the gcd function.
Now with the combination clang 3.2 + emscripten + asmjs + FF nightly:
clang 3.2 emscripten asmjs FF + multiple_hosts: 1000 ms / 13000 ms
So that's 200 to 250 times slower than native clang.
| Reporter | ||
Comment 13•12 years ago
|
||
(In reply to Alon Zakai (:azakai) from comment #11)
> The problem in that gcd function is that it is not relooped. It uses
> indirect jumps - which Marc tells me is common in Gambit - which prevent
> emscripten from relooping the function. So we end up with the while-switch
> pattern, and emscripten currently cannot perform variable elimination on it
> very efficiently, we only do so inside basic blocks, that is, one case
> statement. Variables spanning case statements will not be eliminated.
Let me clarify. In one of the compilation modes of Gambit, each basic block is prefixed by a C label, and control flows between basic blocs either with a "goto label" or "if (some_condition) goto *(ptr->label); else goto return_to_trampoline;". However that compilation mode was not used in the test program because clang's implementation of "goto*" is buggy. So instead the following style of code is used:
start = <int_expression_corresponding_to_first_label>;
jump: switch (pc -= start) {
case 0: L0: ...; // code at first label
case 1: L1: ...; // etc
case 2: L2: ...;
}
and a jump to another basic block is either
goto Ln; // when destination is known statically
or
pc = <destination_expression>; goto jump;
Note that the "jump: switch ..." is only used for the control flow when the destination basic block is dynamically known (for implementing Scheme function returns and dynamic calls).
I guess that style of code also can't be relooped by emscripten. I just wanted to clarify that it wasn't due to a computed goto, i.e. "goto *dest".
Comment 14•12 years ago
|
||
Ah, I can see how unrelooped code would be incredibly slower than native. That leaves the question of why asm.js is slower than non-asm.js. Given that non-asm.js will definitely put this one hot function in the baseline jit, I think the difference we are seeing is Ion vs. Baseline speed. While this is all pathological territory, it'd be good to look into this with VTune and see if we are doing something bad for, e.g., the while(1)switch(label) pattern.
| Reporter | ||
Comment 15•12 years ago
|
||
If instead of the "jump: switch ..." style Gambit were to generate the style
for (;;) switch (pc -= start) {...}
and it replaced "goto Ln" by "pc = start+n; continue;", would that be handled better by emscripten/asmjs?
Comment 16•12 years ago
|
||
Actually both should be relooped by emscripten. It is only setjmp and indirect branches (not statically known) that can foil relooping. So one suspicion is that LLVM is generating indirect branches here as an optimization. We can look in the LLVM IR to check, if you want send me the bitcode and name of the relevant function.
| Reporter | ||
Comment 17•12 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #14)
> Ah, I can see how unrelooped code would be incredibly slower than native.
> That leaves the question of why asm.js is slower than non-asm.js. Given
> that non-asm.js will definitely put this one hot function in the baseline
> jit, I think the difference we are seeing is Ion vs. Baseline speed. While
> this is all pathological territory, it'd be good to look into this with
> VTune and see if we are doing something bad for, e.g., the
> while(1)switch(label) pattern.
I don't see how a factor of 250x can be explained by failure to reloop. Maybe a factor of 2x, 5x or 10x. But 250x is a lot of x-es.
As far as I understand this, the relooper fails to rewrite "goto Ln" into a JS control flow statement (while, if, etc). Instead it generates a "pc = n; break;" which ends up at a "switch (pc) { ... case n: ... }".
In the worst case where the test program would be doing "goto Ln" all the time and nothing else, it would mean that the "goto Ln" is 250x times faster than going through a switch. That would be a mighty slow implementation of switch. Perhaps the switch is implemented with a cascade of ifs?
| Reporter | ||
Comment 18•12 years ago
|
||
(In reply to Alon Zakai (:azakai) from comment #16)
> Actually both should be relooped by emscripten. It is only setjmp and
> indirect branches (not statically known) that can foil relooping. So one
> suspicion is that LLVM is generating indirect branches here as an
> optimization. We can look in the LLVM IR to check, if you want send me the
> bitcode and name of the relevant function.
I suggest you use the attached script to recompile the test program. That way you can experiment locally. If that's not possible I will send you the bitcode file.
Comment 19•12 years ago
|
||
Looking in the IR, that H__23__23_gcd function has lots of indirect branches, even without LLVM optimizations. The last block has a huge indirectbr with dozens of options. Pergaps clang emits indirect branches when it sees a certain pattern.
How would I find the C source code that was compiled into that?
| Reporter | ||
Comment 20•12 years ago
|
||
The function is in gambit/lib/_num.c .
You'll need to pass the code to the C preprocessor if you want to read it:
% cd gambit/lib
% clang -E -I ../include/gambit.h _num.c
| Reporter | ||
Comment 21•12 years ago
|
||
You might want to "indent" the output. Look for the C function ___H__23__23_gcd .
| Reporter | ||
Comment 22•12 years ago
|
||
Sorry, it should be:
clang -E -I ../include -D___NOT_USE_LABEL_VALUES=1 _num.c
Otherwise you'll get the computed gotos.
| Reporter | ||
Comment 23•12 years ago
|
||
Just to clarify, the attached build-gambit-emscripten script will use computed gotos and it does not use asmjs (because of the -s ASM_JS=0). After changing to -s ASM_JS=1, the use of computed gotos can be removed by adding a -D___NOT_USE_LABEL_VALUES=1 at the end of the GAMBIT_CFLAGS at the top of the script.
I have tried it with and without computed gotos:
with "jump: switch (pc -= start) {..." : 1.4 seconds
with "goto *dest": 0.8 seconds
Note that I have updated my FF nightly since the original experiment, which might explain the variation in run time.
So the computed gotos are helping quite a bit.
| Reporter | ||
Comment 24•12 years ago
|
||
Hmmm. I was trying to understand why the run times for the test program vary between experiments. I have noticed that the run times are very consistent when they are run in FF with the profiler turned on (0.77 seconds +/- 0.03), and unstable when running with the profiler turned off (0.90 seconds +/- 0.25). The average run time is highest when the profiler is *off*, which is counterintuitive given the extra signals and accounting that have to happen when profiling. This makes me question the validity of the profiling. Why does the run time vary so much (about +/- 30%) when profiling is off and why is it *higher* than when profiling is on?
Comment 25•12 years ago
|
||
With small amounts of C code I cannot reproduce the automatic creation of indirect branches that LLVM does.
I also tried compiling that function standalone, by hacking it a little to get rid of globals etc. But then it did not use indirectbrs, it was normal branches and it relooped.
So the question is what gets clang to generate indirectbrs here. What is the specific command we use to compile that source file? Is some special gcc/clang option being used to optimize interpreter-loop style code? Or perhaps some __attribute__ in a header affects this?
Or it might be that above a certain size of jump table or a certain number of crisscrossing gotos, it decides to rewrite to indirect branches.
| Reporter | ||
Comment 26•12 years ago
|
||
(In reply to Alon Zakai (:azakai) from comment #25)
> With small amounts of C code I cannot reproduce the automatic creation of
> indirect branches that LLVM does.
>
> I also tried compiling that function standalone, by hacking it a little to
> get rid of globals etc. But then it did not use indirectbrs, it was normal
> branches and it relooped.
>
> So the question is what gets clang to generate indirectbrs here. What is the
> specific command we use to compile that source file? Is some special
> gcc/clang option being used to optimize interpreter-loop style code? Or
> perhaps some __attribute__ in a header affects this?
Are you sure you used -D___NOT_USE_LABEL_VALUES=1 ? Otherwise gambit.h will expand the virtual instruction macros in _num.c to use computed gotos.
Comment 27•12 years ago
|
||
Yes, I ran
$ ~/Dev/clang+llvm-3.2-x86-linux-ubuntu-12.04/bin/clang -E -I ../include -D___NOT_USE_LABEL_VALUES=1 _num.c > a.c
and I got
http://www.pastebin.mozilla.org/2512777
Comment 28•12 years ago
|
||
I attempted to profile the code from Comment 0 in VTune, but the call to js_AtomToPrintableString() in SendFunctionsToVTune() fails with OOM.
| Reporter | ||
Comment 29•12 years ago
|
||
You mean the code is too big causing an OOM error when it is sent to VTune? That's strange. 11 MB is a lot of code, but it shouldn't overflow memory. Is the whole code of a file being sent, or just 1 function?
Comment 30•12 years ago
|
||
(In reply to Marc Feeley from comment #29)
> You mean the code is too big causing an OOM error when it is sent to VTune?
Surprisingly, it's just that (GC-capable!) string generation function mentioned in Comment 28. I'll try again tomorrow -- it might work to remove function name tracking and import the source line number information patch from Bug 837128.
Comment 31•12 years ago
|
||
ahem, "from Bug 873128".
| Reporter | ||
Comment 32•12 years ago
|
||
(In reply to Alon Zakai (:azakai) from comment #25)
> With small amounts of C code I cannot reproduce the automatic creation of
> indirect branches that LLVM does.
>
> I also tried compiling that function standalone, by hacking it a little to
> get rid of globals etc. But then it did not use indirectbrs, it was normal
> branches and it relooped.
>
> So the question is what gets clang to generate indirectbrs here. What is the
> specific command we use to compile that source file? Is some special
> gcc/clang option being used to optimize interpreter-loop style code? Or
> perhaps some __attribute__ in a header affects this?
>
> Or it might be that above a certain size of jump table or a certain number
> of crisscrossing gotos, it decides to rewrite to indirect branches.
I think you are right if the following paper is still valid:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.9195&rep=rep1&type=pdf
According to that paper a switch is translated by LLVM using one of several strategies depending on the number of cases (either linear branch cascade, balanced binary search, jump tables, etc).
I did a test with a smaller test program (fib) and clang+emscripten+asmjs fares much better... the factor slowdown from native clang is about 5x which is much more reasonable than the 250x we are getting for the 1000 digits of pi test program (which spends most of its time in gcd which has a large switch). There are only 4 or 5 switch cases in the fib function, so indirect branches through a jump table are probably avoided in favor of conditional branched, and this allows emscripten to do the relooping.
Doesn't this mean that emscripten will fare poorly for functions containing switch statements with a large number of cases? It would be interesting to determine what is the threshold where llvm starts using a jump table.
The fib test was done with a js shell that is a few days old. When I repeated the experiment with the current js shell I get:
fib.js:3675:34019 warning: asm.js type error: int is not a subtype of double
and the program runs 7x slower (so 35x slower than native clang) because it couldn't use asmjs. I will file a separate bug for this.
| Reporter | ||
Comment 33•12 years ago
|
||
(In reply to Marc Feeley from comment #32)
> Doesn't this mean that emscripten will fare poorly for functions containing
> switch statements with a large number of cases? It would be interesting to
> determine what is the threshold where llvm starts using a jump table.
From the file lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp in LLVM:
// If the switch has more than N blocks, and is at least 40% dense, and the
// target supports indirect branches, then emit a jump table rather than
// lowering the switch to a binary tree of conditional branches.
// N defaults to 4 and is controlled via TLS.getMinimumJumpTableEntries().
The MinimumJumpTableEntries is not changed from the default anywhere else in LLVM (except for the Hexagon CPU).
So does this mean that functions containing a dense switch statement with 5 cases or more are not going to be relooped by emscripten? That would be pretty sad.
| Reporter | ||
Comment 34•12 years ago
|
||
I have created another test which measures the performance of functions containing an increasing number of switch cases. I took fib and made a few copies nested inside a toplevel Scheme function. So fib1 has 1 copy of recursive fib, fib2 has 2 copies, etc. I then measured the run time of each version. Here are the results for asmjs with the ratio compared to native clang (which takes 0.090 seconds regardless of the number of switch cases):
fib0 : 8 cases, 0.404 seconds (4.5x)
fib1 : 11 cases, 0.466 seconds (5.2x)
fib2 : 17 cases, 0.929 seconds (10.3x)
fib3 : 23 cases, 1.312 seconds (14.6x)
fib4 : 29 cases, 1.708 seconds (19.0x)
fib9 : 59 cases, 4.463 seconds (49.6x)
fib30: 243 cases, 31.494 seconds (350.0x)
What is surprising in this experiment is that the ratio between asmjs and native clang increases with the number of switch cases. Could it be that the asmjs switch is not using a jump table?
For reference the Scheme function gcd has 240 switch cases. This could explain the 250x slowdown of gcd.
Comment 35•12 years ago
|
||
Odin unconditionally generates an MTableSwitch for switch statements (and MTableSwitch is unconditionally a real table dispatch), so I don't think that can be the issue. Have you inspected the generated JS to see that there is actually a switch statement?
| Reporter | ||
Comment 36•12 years ago
|
||
Oh my god! The C switch statement has been translated by emscripten into a linear sequence of ifs. No wonder the run time grows linearly with the number of cases.
This i a serious bug!
Comment 37•12 years ago
|
||
(In reply to Marc Feeley from comment #36)
> Oh my god! The C switch statement has been translated by emscripten into a
> linear sequence of ifs. No wonder the run time grows linearly with the
> number of cases.
>
> This i a serious bug!
Also present in bug 864587. Looking for "_longjmp($arraydecay_" (without quotes) on the generated source code (https://bug864587.bugzilla.mozilla.org/attachment.cgi?id=747235), there are >2,000 if conditions in a sequence, which look like a switch.
| Reporter | ||
Comment 38•12 years ago
|
||
I have filed bug report #1278 on the emscripten repo:
https://github.com/kripken/emscripten/issues/1278
Comment 39•12 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #37)
> (In reply to Marc Feeley from comment #36)
> > Oh my god! The C switch statement has been translated by emscripten into a
> > linear sequence of ifs. No wonder the run time grows linearly with the
> > number of cases.
> >
> > This i a serious bug!
>
> Also present in bug 864587. Looking for "_longjmp($arraydecay_" (without
> quotes) on the generated source code
> (https://bug864587.bugzilla.mozilla.org/attachment.cgi?id=747235), there are
> >2,000 if conditions in a sequence, which look like a switch.
Might it be possible for the compiler to optimize this particular pattern by detecting a constant assignment to the 'label' and tracing back to the start of the loop and through any tables or conditional trees that depend only on this 'label' and if found then branching directly to the target.
The label assignment could also be optimized away if it can be proven that the target does not depend on it.
The large 'move group' operations, needlessly shuffling around the variables, also seems to be a compiler issue, and perhaps a reduced example could help analysis.
| Reporter | ||
Comment 40•12 years ago
|
||
(In reply to Douglas Crosher [:dougc] from comment #39)
> (In reply to Benjamin Bouvier [:bbouvier] from comment #37)
> > (In reply to Marc Feeley from comment #36)
> > > Oh my god! The C switch statement has been translated by emscripten into a
> > > linear sequence of ifs. No wonder the run time grows linearly with the
> > > number of cases.
> > >
> > > This i a serious bug!
> >
> > Also present in bug 864587. Looking for "_longjmp($arraydecay_" (without
> > quotes) on the generated source code
> > (https://bug864587.bugzilla.mozilla.org/attachment.cgi?id=747235), there are
> > >2,000 if conditions in a sequence, which look like a switch.
>
> Might it be possible for the compiler to optimize this particular pattern by
> detecting a constant assignment to the 'label' and tracing back to the start
> of the loop and through any tables or conditional trees that depend only on
> this 'label' and if found then branching directly to the target.
>
> The label assignment could also be optimized away if it can be proven that
> the target does not depend on it.
These optimizations could be done by either emscripten or the JS compiler. In fact it would be good to do it at both levels. In emscripten for the C to JavaScript path on any JS VM, and in the JS compiler for the case of other compilers which target JS.
> The large 'move group' operations, needlessly shuffling around the
> variables, also seems to be a compiler issue, and perhaps a reduced example
> could help analysis.
I have also seen some strange code patterns being output by emscripten. One case I remember is speculatively lifting side-effect free simple expressions out of the branches of a large switch in a loop. This is a bad idea because few of those expressions might be executed and when the expressions are simple it is often faster to compute them on the fly rather than reading them from a variable in memory.
Comment 41•11 years ago
|
||
Looks like https://github.com/kripken/emscripten/issues/1278 has been resolved. Is this issue fixed?
| Assignee | ||
Updated•11 years ago
|
Assignee: general → nobody
Updated•3 years ago
|
Severity: normal → S3
Updated•2 years ago
|
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•