Created attachment 560597 [details]
sqlite port to js
This is a followup to bug 686919. Running closure compiler on that code makes it avoid the correctness issue, but the slowdown remains.
The attachment contains a port of SQLite to JS. I think it would be extremely useful to be able to use SQLite on the web with good performance. The current numbers I get are:
native : 0.75
-m -j -p : 25
-m -n : 52
-m -j -p is over 30X slower than native code, while -m -n is twice slower still. However, this is all compiled code from C, so it should play very nicely with type inference. I was hoping to get similar results as on benchmark code I have compiled in the past, which is just 3X-5X slower than native code.
The attachment contains several files:
sqlite_0_1_cc.js - compiled code + closure compiler
sqlite_0_1_cc_pretty.js - compiled code + closure compiler using pretty printing. This is for readability. Sadly though it won't run, it hits "too much recursion" apparently during parsing (should I file a separate bug?)
src.c - complete C source code
src.c.o.js - compiled code before closure compiler
vars - map of variables in src.c.o.js to those in the closure compiled code
Any help with getting this to run fast would be great. Aside from figuring out the specific TI issue here, advice about changes to the compiled code would be very appreciated as well.
First problem here is the size of _sqlite3VdbeExec. This function is analyzed about 10 times (after a GC) and this takes 10 * 2.5 = 25 seconds. The function is recompiled about 50 times. This is faster (0.7 seconds) but it probably adds up to another 25 seconds. Filed bug 689284.
Testing on the JM branch + bug 697861 patch, this behaves somewhat differently. I get:
gcc -O3: .4s
gcc -O0: .75s
-m -n: 13.5s
pre-TI -m: 23.6s (??)
The GCs being triggered are all due to too much malloc'ing. It seems like this is due to compilation activity, though I need to confirm. If I remove GC malloc triggers, the time spent analyzing with -m -n (though not compiling) is significantly reduced:
(times in ms)
Time analyzeLifetimes: 56.79
Time analyzeSSA: 257.55
Time analyzeTypes: 358.10
Time performCompilation: 5243.62
Time analyzeBytecode: 75.77
Time analyzeLifetimes: 19.76
Time analyzeSSA: 97.28
Time analyzeTypes: 153.18
Time performCompilation: 5295.58
Time analyzeBytecode: 33.16
The patch in bug 706914 does a good job with this testcase. For some reason the baseline compilation time creeped up to 8s from 5.3s sometime over the last month. But with a chunk limit of 1000 (rough number of bytes of bytecode per compilation unit), compilation time is reduced to 1.1s and the overall execution time from 16.9s to 10.7s). Analysis time is .3s. This is still 14x slower than unoptimized C, but at least the patch removes compilation and analysis as the performance bottleneck --- remaining issues are in efficiency of the generated jitcode and of the translated JS.
Time analyzeLifetimes: 18.23
Time analyzeSSA: 94.47
Time analyzeTypes: 131.01
Time performCompilation: 1067.98
Time analyzeBytecode: 39.21
Created attachment 595153 [details]
sqlite benchmark that also works in v8/node and is readable
Attached is a benchmark that also works in node/v8 (the previous attachment does not). It is also more readable (not closure-compiled). Results are
js -m 9s
js -m -n 250s
TI is 28 times slower than node and spidermonkey without TI. This is on mozilla-central trunk and node 0.6.5.
I also get similar results in sqlite_0_1_cc_pretty.js from the old archive attachment,
js -m 9s
js -m -n 142s
TI is 15x slower on that older benchmark.
Created attachment 595464 [details] [diff] [review]
don't update gcMallocBytes during compilation
The problem here seems to be that the footprint of malloc'ed bytes used by the compiled code is greater than the malloc bytes GC trigger. All the compiled code is discarded on GC, so we end up in a loop where lots of code gets compiled, triggers a GC which throws everything away and starts everything over (see also: Sisyphus).
This may or may not have been induced by bug 679026, but it's a preexisting issue for sure. Since data malloc'ed during compilation is not held by any GC thing (being either stack allocated or discarded on GC), it shouldn't count towards the GC malloc trigger.
Hm adding another mechanism to allocate memory without accounting for it? What's the problem with OffTheBooks alloction?
I am seeing this on compiled Box2D as well. With the patch here, it becomes 5x faster (and 2x faster than v8).
(In reply to Gregor Wagner [:gwagner] from comment #6)
> Hm adding another mechanism to allocate memory without accounting for it?
> What's the problem with OffTheBooks alloction?
Brian, any comments why you didn't use OffTheBooks allocation?
Sorry, missed your earlier comment. I'll put together another patch which uses OffTheBooks allocation instead --- the earlier one is simpler and more of a temporary kludge in case we make OffTheBooks allocation the default (which I think it should be).
Created attachment 596509 [details] [diff] [review]
use OffTheBooks allocation
Comment on attachment 596509 [details] [diff] [review]
use OffTheBooks allocation
I think you also have to adjust the OffTheBooks counter in the Makefile.