Last Comment Bug 687127 - Large compiled code thrown out and re-compiled constantly
: Large compiled code thrown out and re-compiled constantly
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla13
Assigned To: Brian Hackett (:bhackett)
Depends on: 689284 690446 698100 706914
Blocks: infer-perf-regress 644244
  Show dependency treegraph
Reported: 2011-09-16 11:44 PDT by Alon Zakai (:azakai)
Modified: 2012-02-13 09:09 PST (History)
13 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

sqlite port to js (2.37 MB, application/octet-stream)
2011-09-16 11:44 PDT, Alon Zakai (:azakai)
no flags Details
sqlite benchmark that also works in v8/node and is readable (550.33 KB, application/octet-stream)
2012-02-07 13:33 PST, Alon Zakai (:azakai)
no flags Details
don't update gcMallocBytes during compilation (3.85 KB, patch)
2012-02-08 11:16 PST, Brian Hackett (:bhackett)
no flags Details | Diff | Review
use OffTheBooks allocation (9.63 KB, patch)
2012-02-12 12:48 PST, Brian Hackett (:bhackett)
anygregor: review+
Details | Diff | Review

Description Alon Zakai (:azakai) 2011-09-16 11:44:20 PDT
Created attachment 560597 [details]
sqlite port to js

This is a followup to bug 686919. Running closure compiler on that code makes it avoid the correctness issue, but the slowdown remains.

The attachment contains a port of SQLite to JS. I think it would be extremely useful to be able to use SQLite on the web with good performance. The current numbers I get are:

native   :  0.75
-m -j -p : 25
-m -n    : 52

-m -j -p is over 30X slower than native code, while -m -n is twice slower still. However, this is all compiled code from C, so it should play very nicely with type inference. I was hoping to get similar results as on benchmark code I have compiled in the past, which is just 3X-5X slower than native code.

The attachment contains several files:

  sqlite_0_1_cc.js - compiled code + closure compiler
  sqlite_0_1_cc_pretty.js - compiled code + closure compiler using pretty printing. This is for readability. Sadly though it won't run, it hits "too much recursion" apparently during parsing (should I file a separate bug?)
  src.c - complete C source code
  src.c.o.js - compiled code before closure compiler
  vars - map of variables in src.c.o.js to those in the closure compiled code

Any help with getting this to run fast would be great. Aside from figuring out the specific TI issue here, advice about changes to the compiled code would be very appreciated as well.
Comment 1 Jan de Mooij [:jandem] 2011-09-26 14:11:54 PDT
First problem here is the size of _sqlite3VdbeExec. This function is analyzed about 10 times (after a GC) and this takes 10 * 2.5 = 25 seconds. The function is recompiled about 50 times. This is faster (0.7 seconds) but it probably adds up to another 25 seconds. Filed bug 689284.
Comment 2 Brian Hackett (:bhackett) 2011-10-28 11:03:30 PDT
Testing on the JM branch + bug 697861 patch, this behaves somewhat differently.  I get:

gcc -O3: .4s
gcc -O0: .75s
-m: 11.9s
-m -n: 13.5s
pre-TI -m: 23.6s (??)
d8: 46.5s

The GCs being triggered are all due to too much malloc'ing.  It seems like this is due to compilation activity, though I need to confirm.  If I remove GC malloc triggers, the time spent analyzing with -m -n (though not compiling) is significantly reduced:

(times in ms)

gcMalloc enabled:

Time analyzeLifetimes: 56.79
Time analyzeSSA: 257.55
Time analyzeTypes: 358.10
Time performCompilation: 5243.62
Time analyzeBytecode: 75.77

gcMalloc disabled:

Time analyzeLifetimes: 19.76
Time analyzeSSA: 97.28
Time analyzeTypes: 153.18
Time performCompilation: 5295.58
Time analyzeBytecode: 33.16
Comment 3 Brian Hackett (:bhackett) 2011-12-01 17:07:02 PST
The patch in bug 706914 does a good job with this testcase.  For some reason the baseline compilation time creeped up to 8s from 5.3s sometime over the last month.  But with a chunk limit of 1000 (rough number of bytes of bytecode per compilation unit), compilation time is reduced to 1.1s and the overall execution time from 16.9s to 10.7s).  Analysis time is .3s.  This is still 14x slower than unoptimized C, but at least the patch removes compilation and analysis as the performance bottleneck --- remaining issues are in efficiency of the generated jitcode and of the translated JS.

Time analyzeLifetimes: 18.23
Time analyzeSSA: 94.47
Time analyzeTypes: 131.01
Time performCompilation: 1067.98
Time analyzeBytecode: 39.21
Comment 4 Alon Zakai (:azakai) 2012-02-07 13:33:53 PST
Created attachment 595153 [details]
sqlite benchmark that also works in v8/node and is readable

Attached is a benchmark that also works in node/v8 (the previous attachment does not). It is also more readable (not closure-compiled). Results are

node       9s
js -m      9s
js -m -n 250s

TI is 28 times slower than node and spidermonkey without TI. This is on mozilla-central trunk and node 0.6.5.

I also get similar results in sqlite_0_1_cc_pretty.js from the old archive attachment,

js -m      9s
js -m -n 142s

TI is 15x slower on that older benchmark.
Comment 5 Brian Hackett (:bhackett) 2012-02-08 11:16:36 PST
Created attachment 595464 [details] [diff] [review]
don't update gcMallocBytes during compilation

The problem here seems to be that the footprint of malloc'ed bytes used by the compiled code is greater than the malloc bytes GC trigger.  All the compiled code is discarded on GC, so we end up in a loop where lots of code gets compiled, triggers a GC which throws everything away and starts everything over (see also: Sisyphus).

This may or may not have been induced by bug 679026, but it's a preexisting issue for sure.  Since data malloc'ed during compilation is not held by any GC thing (being either stack allocated or discarded on GC), it shouldn't count towards the GC malloc trigger.
Comment 6 Gregor Wagner [:gwagner] 2012-02-08 13:49:43 PST
Hm adding another mechanism to allocate memory without accounting for it? What's the problem with OffTheBooks alloction?
Comment 7 Alon Zakai (:azakai) 2012-02-12 12:04:03 PST
I am seeing this on compiled Box2D as well. With the patch here, it becomes 5x faster (and 2x faster than v8).
Comment 8 Gregor Wagner [:gwagner] 2012-02-12 12:22:23 PST
(In reply to Gregor Wagner [:gwagner] from comment #6)
> Hm adding another mechanism to allocate memory without accounting for it?
> What's the problem with OffTheBooks alloction?

Brian, any comments why you didn't use OffTheBooks allocation?
Comment 9 Brian Hackett (:bhackett) 2012-02-12 12:32:11 PST
Sorry, missed your earlier comment.  I'll put together another patch which uses OffTheBooks allocation instead --- the earlier one is simpler and more of a temporary kludge in case we make OffTheBooks allocation the default (which I think it should be).
Comment 10 Brian Hackett (:bhackett) 2012-02-12 12:48:35 PST
Created attachment 596509 [details] [diff] [review]
use OffTheBooks allocation
Comment 11 Gregor Wagner [:gwagner] 2012-02-12 13:14:19 PST
Comment on attachment 596509 [details] [diff] [review]
use OffTheBooks allocation

I think you also have to adjust the OffTheBooks counter in the Makefile.
Comment 12 Brian Hackett (:bhackett) 2012-02-12 19:29:01 PST
Comment 13 Marco Bonardo [::mak] 2012-02-13 09:09:45 PST

Note You need to log in before you can comment on or make changes to this bug.