539123 - Measure cost of trace/interpreter value and tag conversions

Reporter

Description

•

16 years ago

We are trying to decide for JaegerMonkey where to store type tags for values stored on the stack (for interpreted & Jaeger modes). Major alternatives are: 1. Type tags are stored in a separate byte array that is logically parallel to the value stack array. Luke pointed out that the key advantage and disadvantage are: + The type tag array can be just like a trace typemap, making transitions on and off trace easier. - For fast natives, we need to either form old-style jsvals before the call, making them slower (but maybe by only 2-3 instructions per arg); or modify the fast native protocol to pass in the type tag stack. 2. NaN-boxing. 3. Type tags are stored next to values in the standard stack array. There are a few variants of this idea; the simplest is to have each stack slot hold a word tag followed by a word value. Currently 1 and 2 seem to be the strongest contenders. In order to get a better handle on this, we need to measure how much we pay handling value and tag conversions when going on and off trace. As a first cut, I think we want to measure how much time we spend importing values into the trace (untagging and copying them from the interpreter) and converting values back when we leave trace (applying the type map to tag the values). We should try to isolate this as much as possible (keep in mind rdtsc() takes about 90-100 cycles to run) and measure: - number of values imported - time spent importing - number of values exported - time spent exporting

David Mandelin [:dmandelin]

Reporter

Comment 1

•

16 years ago

Additional comment on choices 2 and 3 (nan-boxing and inline type tags without bit stealing): Choice 2. In nan-boxing, we use 8 bytes to store tagged values. Some of the don't-care nan bits (bit 63 and bits 0-51) are used for tags. Key facts about nan-boxing: - tagged values are 8 bytes - checking the type may requires masking depending on the scheme - getting the value doesn't require much masking or shifting, but it can require a little bit in some cases. (E.g., the jsc method requires an integer subtraction to get a |double| value.) - doubles don't have to go on the heap. With inline tags, there are a few major options. Choice 3a. One option, which I believe is used in jsc, is to use one word for the type and another word for a 32-bit value, values are 8 bytes. Key facts: - values are 8 bytes - checking the type doesn't require masking - getting the value doesn't require shifting - doubles still must go on the heap. Choice 3b. Another option, used in Lua, is to use one word for the type and 2 words for the value, so doubles can be stored directly. Key facts: - values are 12 bytes - checking the type doesn't require masking - getting the value doesn't require shifting - doubles don't have to go on the heap. Consumer Reports-style comparison chart: value size no-mask? no-shift? no-heap doubles? 8-byte nan + ? = + 8-byte inline + + + - 12-byte inline - + + + '+' means good, '-' means bad. In detail, for 'value size', + means smaller (8 bytes) and '-' means larger (12 bytes). For the rest, '+' means yes to the predicate at the top of the column, '=' means sometimes yes, and '-' means no. The results in bug 539097 suggest that any masking and shifting required for nan-boxing have minimal cost. That in turn suggests nan-boxing is the best of these 3. But if not, then it would depend on the tradeoff between heap doubles and larger values.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 2

•

16 years ago

Attached patch instrumentation — Details — Splinter Review

The first thing I was interested in seeing was how much this actually matters for tracing, because ideally I would like to see fast transitions between baseline/tracer. Attached is an instrumentation patch. A really bad case: v8-early-boyer, lots of trace entries/exits because an outer loop fails to trace. 100% of trace time 69% in ExecuteTree 57% in LeaveTree 39% in building global exit typemap 17% in FlushNativeStackFrame 17% in FlushNativeGlobalFrame .5% in SynthesizeFrame 16% in ExecuteTrace 11% in BuildNativeGlobalFrame 7% in BuildNativeStackFrame 23% in FindVMCompatiblePeer or findNestedCompatiblePeer 3% in fragment lookup A more average case where we could do better - 3d-cube: 100% of trace time 90% in ExecuteTree 68% in ExecuteTrace 22% in LeaveTree 60% in FlushNativeStackFrame 12% in building global exit typemap 9% in FlushNativeGlobalFrame 0% in SynthesizeFrame 4% in BuildNativeStackFrame 2% in BuildNativeGlobalFrame 7% in FindVMCompatiblePeer or findNestedCompatiblePeer 1% in fragment lookup

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 3

•

16 years ago

The high cost in finding compatible peers is interesting, right now we walk the stack frame and decode jsvals for every peer. I tried building this typemap beforehand but it made things 60-100% slower. Probably because this introduced an extra loop (and we can't memcmp() because a TT_INT32 can go into a TT_DOUBLE), but I will try to drill it down further.

David Mandelin [:dmandelin]

Reporter

Comment 4

•

16 years ago

So it looks like for v8-early-boyer, the build/flush time combined is about 25% of trace time. For 3d-cube, it's about 20%. The main question now is how much it would speed those parts up if we use the byte array tag stack in the interpreter. My (uneducated) guess would be that extracting and translating type tags is a fairly low fraction of the build/flush time, maybe 10% or less, in which case there is not much improvement to be had here. But if it's more like 50%, then this could be a nice speedup. Did you get those counts of how many jsvals we translate? If we put that together with an estimate of the number of instructions to deal with tags per value, we should have a better guess at the speedup potential.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 5

•

16 years ago

I will come back to the question in comment #4 in a bit -- in the interim I wrote a tiny tool we can use to get a better idea of the benefits + costs here. http://hg.mozilla.org/users/danderson_mozilla.com/acmescript You can write simple scripts that return a value (no function calls, numbers/bools only so far). It generates stack-based bytecode and has an interpreter loop. To mimic SpiderMonkey it can heap allocate doubles and collect them periodically. So far I've implemented SpiderMonkey-style tags (heap allocated numbers) and NaN boxing. The sole "benchmark" (gcd.acme) reveals: Tagged boxes, x64: 6.9s Tagged boxes, x32: 3.7s NaN boxes, x64: 2.9s NaN boxes, x32: 5.8s There's still a lot more to experiment (different benchmark behavior, more representations, decoding stack for transitions to trace and API) but it should be easy to play around with.

David Mandelin [:dmandelin]

Reporter

Comment 6

•

16 years ago

Interesting. I'd like to see what happens to those numbers with a JaegerMonkey-style JIT.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 7

•

16 years ago

JSC's "64-bit vals on 32-bit" support is actually a little different, it cleverly avoids heap allocating numbers. Testing all three variants of wide boxes is good though: A: [32-bit tag, 32-bit payload] with the ability to encode doubles B: [32-bit tag, 32-bit payload] with heap allocated doubles C: [32-bit tag, 64-bit payload] (Lua style) The first two options only work well on 32-bit CPUs. Wide boxes "A", x32: 3.2s Wide boxes "B", x32: 4.0s Wide boxes "C", x32: 3.2s Wide boxes "C", x64: 3.6s This is not a totally fair test for the really wide boxes because there will be excess copying going on. Lua is register based so it doesn't need the expensive "getlocal; getlocal; add; setlocal" patterns. And the JIT could make these go away anyway, so indeed it would be nice to test that.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 8

•

16 years ago

I've implemented idea #1 in a separate interpreter, because abstracting the stack manipulation here was skewing benchmark results. Now that I'm done I think it could be done better with macros, but anyway... Here are two sets of tests. The first set uses the single-stack interpreter, and tests all value representations: * BOX = What SpiderMonkey does now * NAN = 64-bit NaN Boxing * 32N = 32-bit NaN Boxing * 32T = 32-bit value, 32-bit tag * 64T = 64-bit value, 32-bit tag All times are in seconds. ------------------------------------ | BOX | NAN | 32N | 32T | 64T | ------------------------------------------ | x64 | 6.77 | 2.88 | N/A | N/A | 3.48 | |---------------------------------------- | x86 | 3.64 | 5.50 | 3.06 | 3.95 | 3.38 | ------------------------------------------ The reason 64-bit NaN boxing is slow on x86 is probably because the code does bit arithmetic on 64-bit ints. I could probably make a version that is more clever and avoids checking all bits, but I don't think it'd be better than 32-bit NaN boxing, where tag checking is still cheaper. Now for the "split stack" idea where there is a stack for types and a stack for values - the engine still needs boxed values (as would, in theory, Jagermonkey). In this benchmark tool, all constants (including 0, 1) are pooled in the script object as boxed values. So loading those into the stack is no longer a copy, the type has to be decoded. Taking that into account, I tested each boxing variant against the dual stack interpreter, for both platforms, twice. Once for 32-bit sized tags and once for 8-bit sized tags. 32-bit tag stack: ------------------------------------ | BOX | NAN | 32N | 32T | 64T | ------------------------------------------ | x64 | 2.53 | 2.50 | N/A | N/A | 3.57 | |---------------------------------------- | x86 | 3.07 | 4.37 | 4.43 | 3.34 | 3.74 | ------------------------------------------ 8-bit tag stack: ------------------------------------ | BOX | NAN | 32N | 32T | 64T | ------------------------------------------ | x64 | 2.52 | 2.50 | N/A | N/A | 2.51 | |---------------------------------------- | x86 | 3.09 | 4.95 | 3.11 | 3.15 | 3.14 | ------------------------------------------

Luke Wagner [:luke]

Comment 9

•

16 years ago

This is excellent! I'm excited to see the measurements simulating entering/leaving trace. One question about 64T: it seems like every other payload, if accessed as an 8-byte value, would be misaligned and possibly penalized. Out of curiosity (and for completeness), could you test 64T with an extra 4 bytes of padding?

David Mandelin [:dmandelin]

Reporter

Comment 10

•

16 years ago

(In reply to comment #8) Agreed, excellent! I have some dumb questions though: - What is 32-bit NaN boxing? - For the split stack results, what exactly are the representations? Does BOX mean that you store SM-style boxed jsvals in the value stack and then there is also a separate tag stack? Or does it mean the value stack is raw values and the tag stack holds SM-style tags? Or something else?

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 11

•

16 years ago

32-bit NaN boxing: union { double d; struct { int32 payload; int32 tag; }; } Now you can use a bunch of values in the NaN range for tags (WebKit uses 0xFFFFFFF7 ... 0xFFFFFFFF or so). Anything in that range means there is a 32-bit payload. The split stack has one stack for raw values and another for SM-style tags. The "BOX" refers to how everything outside the interpreter wants to see values - in that case, SM-style boxes. I'm going to add some syntax to test two more things: * Simulate API calls, to see how expensive it is to box up the split stack for external API (if we decide to do that). * Simulate entering/leaving trace by boxing and unboxing everything on the stack (with and w/out a unified stack layout). This affects both single and split stack interpreters.

Brendan Eich [:brendan]

Comment 12

•

16 years ago

See bug 161110 for the experiment over a year ago to use nan-boxing (I too found "32-bit NaN box" a confusing term!) on 32-bit target platforms. That experiment had a negative result compared to status quo, but it was not a minimal experiment like dvander's code -- so lots of room for confounding variables. /be

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 13

•

16 years ago

Sorry, I'm finding it hard to come up with good names for all these things ;) I think if we pass the really wide boxes by-value everywhere, and don't specialize the NaN-boxing mechanism based on the platform, it will be a loss on x86.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 14

•

16 years ago

I added a pseudo-function call, a native that adds its arguments and returns the result. In this benchmark, it sits in between an inner and outer loop (the outer loop runs 5,000 times, the inner one runs 10,500 times). The addition eventually overflows an integer. I've also taken Luke's suggestion and added a new 128-bit box type. Baseline performance with a single-stack interpreter: ------------------------------------------- | BOX | NAN | 32N | 32T | 64T | 128T | ------------------------------------------------- | x64 | 4.93 | 4.21 | N/A | N/A | 3.69 | 3.69 | |------------------------------------------------ | x86 | 4.16 | 4.81 | 3.14 | 4.29 | 5.33 | 4.31 | ------------------------------------------------- Dual-stack interpreter w/ 32-bit type tags: ------------------------------------------- | BOX | NAN | 32N | 32T | 64T | 128T | ------------------------------------------------- | x64 | 2.62 | 2.63 | N/A | N/A | 2.70 | 2.68 | |------------------------------------------------ | x86 | 3.14 | 3.59 | 3.43 | 3.13 | 3.19 | 3.20 | ------------------------------------------------- Dual-stack interpreter w/ 8-bit type tags: ------------------------------------------- | BOX | NAN | 32N | 32T | 64T | 128T | ------------------------------------------------- | x64 | 2.64 | 2.62 | N/A | N/A | 2.83 | 2.85 | |------------------------------------------------ | x86 | 3.44 | 4.07 | 3.14 | 3.46 | 6.58 | 3.34 | ------------------------------------------------- Looks like this one didn't like the 8-bit tag stack.

Luke Wagner [:luke]

Comment 15

•

16 years ago

Fortunately, unlike the bitwidth of packed values and payloads, we should be able to abstract 32-/8-bit tag types fairly easily and postpone the decision.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 16

•

16 years ago

Yup, given that and the results so far I'm going to stop benchmarking the 8-bit tags. This version calls the native in the innermost loop, and numbers passed+returned stay as integers. Baseline performance with a single-stack interpreter: ------------------------------------------- | BOX | NAN | 32N | 32T | 64T | 128T | ------------------------------------------------- | x64 | 6.89 | 5.98 | N/A | N/A | 6.82 | 6.77 | |------------------------------------------------ | x86 | 6.93 | 7.80 | 5.13 | 7.25 | 6.42 | 7.53 | ------------------------------------------------- Dual-stack interpreter w/ 32-bit type tags: ------------------------------------------- | BOX | NAN | 32N | 32T | 64T | 128T | ------------------------------------------------- | x64 | 6.44 | 4.90 | N/A | N/A | 5.89 | 5.89 | |------------------------------------------------ | x86 | 6.79 | 7.91 | 8.00 | 7.88 | 8.01 | 6.51 | ------------------------------------------------- Lastly (starting to beat this to death), here's a similar benchmark that has to keep boxing doubles instead of integers (it also has more stack transitions). Now the GC can come into play. Baseline performance with a single-stack interpreter: ------------------------------------------- | BOX | NAN | 32N | 32T | 64T | 128T | ------------------------------------------------- | x64 | 9.64 | 9.33 | N/A | N/A | 9.78 | 9.78 | |------------------------------------------------ | x86 | 11.9 | 12.3 | 8.07 | 12.9 | 10.6 | 11.3 | ------------------------------------------------- Dual-stack interpreter w/ 32-bit type tags: ------------------------------------------- | BOX | NAN | 32N | 32T | 64T | 128T | ------------------------------------------------- | x64 | 9.83 | 6.51 | N/A | N/A | 7.82 | 7.69 | |------------------------------------------------ | x86 | 12.3 | 10.8 | 8.64 | 12.5 | 10.0 | 8.91 | -------------------------------------------------

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 17

•

16 years ago

Last set of tests. In this set I've excluded the "64T" and "32T" benchmarks since they don't seem to be going anywhere. This test is worst-case tracing behavior. An outer loop runs 5,000 times and an inner loop runs 10,500 times. The inner loop simulates a trace entry/exit every iteration. For simplicity, it treats I/D as separate types, unlike our tracer. The five columns (now with better naming) are: * SMBOX - SpiderMonkey boxes. * NAN64 - 64-bit NaN boxes (now 64-bit only, doesn't seem good on x86). * NAN32 - 32-bit NaN boxes (32-bit only). * LARGE - 32-bit tag, 32-bit alignment, 64-bit payload. * DUAL - Dual stack model, boxing model is irrelevant for these tests. On x64 I've chosen NAN64, and for x86 I've chosen NAN32. For trace entry/exit, each model has different properties: - DUAL: You can memcpy() the exit typemap on the type stack. The tracer does not need to copy values to a separate stack. - LARGE: The exit type map entries must be placed in each stack slot. The tracer does not need to copy values to a separate stack. Unboxing is a nop. - NAN64: Values can be decoded/encoded in-place. - NAN32: Values can be decoded/encoded in-place. Unboxing can be a nop. - SMBOX: Values must be decoded and encoded in separate stacks. This is not strictly true on 64-bit machines, but for simplicity and consistency with the current JIT, a separate stack is used here. Also, tagged doubles may contain a range of 32-bit integers, unlike the other boxing mechanisms, and this makes type checking slightly more expensive. All of the following tests have 10 values on the stack. ... Worst-case, finding a peer (no trace entry/exit). This is a baseline that assumes running a trace is free. --------------------------------------------- | SMBOX | NAN64 | NAN32 | LARGE | DUAL | --------------------------------------------------- | x64 | 7.53 | 6.68 | | 8.57 | 4.8 | |-------------------------------------------------- | x86 | 11.6 | | 9.46 | 11.0 | 6.51 | --------------------------------------------------- Worst-case, simulating entering/exiting traces: --------------------------------------------- | SMBOX | NAN64 | NAN32 | LARGE | DUAL | --------------------------------------------------- | x64 | 10.6 | 8.36 | | 10.3 | 6.00 | |-------------------------------------------------- | x86 | 14.8 | | 8.74 | 11.2 | 7.93 | ---------------------------------------------------

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 18

•

16 years ago

Since this is a micro-benchmark and it doesn't actually trace, it doesn't tell us other interesting things, like whether having stack values further apart would hurt the traced code. I tried poking around with I/D coercion like TM does, and adding more types - the difference between SMBOX / DUAL grows noticeably as the logic becomes more complex. I also tried moving the trace simulation to an outer loop, so it ran 5,000 times instead of 5,000*10,500. I only tried this for x86 SMBOX since it would show the most difference: - Only Peers: 4.14s - Entry/Exits: 4.28s My conclusions from all this... stuff: * Take these results with a grain of salt. * We should take care to abstract how the baseline JIT generates code storing/loading values and checking types. * It seems hard for us to lose by choosing a new layout later on. * Heap allocated doubles can really kill us. * Various NaN boxing methods seem to work best. * Dual stack works really well with tracing (we expected this). * When tracing isn't pathologically bad, the stack scheme matters much less. However, better schemes are pure win, especially if a later goal is for seamless TM/JM transitions, like TM despecializing a method call by invoking it in JM while on-trace.

David Mandelin [:dmandelin]

Reporter

Comment 19

•

16 years ago

More data analysis: - On x64, the clear winner is NAN with dual stacks. - On x86: - with a single stack, 32N wins. - with dual stacks, there is no clear winner. (Everything except NAN is sometimes best.) - in the basic interpreter tests, single stack is always equal to or better than dual stacks, with a speedup factor of 1.00-1.27x. - in the tracing tests, dual stacks are better, with a speedup factor of 1.10-1.45x. If we are primarily targeting x64, then we should just plan for NAN with dual stacks. If we are primarily targeting x86, then things are unclear. I suspect that the slowdown we see here with dual stacks will be mitigated, because it increases register pressure in the interpreter, but probably won't increase register pressure much in JM. So the dual stacks do seem pretty attractive to me for x86 as well. The value type it seems we can't plan for with x86, but will just have to chose empirically later. Putting those two together, it seems we would want to go with dual stacks and design so the value representation can be changed easily. The X factor in that decision is that dual stacks don't seem to help x86 that much, so *if* our main goal were to release something for x86 sooner, then we wouldn't want to do dual stacks. Conversely, if dual stacks doesn't take *too* long (say, a month or less), then it doesn't seem like much of a schedule risk. Another issue is that if with JM we intend to run only in JM/TM when perf matters (so we only use base interpreter for debugging and stuff), then we could consider making JM use dual stacks, but leave the interpreter alone and convert as necessary (and change the interpreter to use dual stacks later on if desired).

Luke Wagner [:luke]

Comment 20

•

16 years ago

Great data, and both of the analyses make sense. I wanted to make one additional point regarding the ambiguity for what to do on x86. For non-trace measurements, it seems like the differences between all the layouts will be dampened when other normal costs are put back in, e.g., property lookup. E.g., all though its not quite apples-to-apples, I just compiled and compared WebKit's JSVALUE32 vs. JSVALUE32_64 on SunSpider and only about 1% faster (11ms), despite our micro-benchmarks showing the JSVALUE32_64 scheme generally pwning. OTOH, based on David's measurements in the early comments of where our time is spent in worst-case tracing scenarios like early-boyer, I think the dual stack speedup will remain significant in a real interpreter. While there are many cases (we hope most) where trace transition time is negligible, it seems to be the worst-cases that keep jumping up and biting us. Thus, for x86, double stack + 32N seems like the way to go.

David Anderson [:dvander] - inactive, e-mail if emergency

Assignee

Comment 21

•

16 years ago

Here are the benchmarks reproduced on ARM. All numbers are in seconds. They are slightly altered so I could run them without waiting a hundred years. gcd variants: * -ss is single stack. * -ds is dual stack. * -d is "lots of doubles" * -i is "no doubles" ------------------------------------ | SMBOX | NAN64 | NAN32 | LARGE | ----------------------------------------------- | gcd-ss-d | 63.7 | 62.6 | 61.0 | 61.8 | ----------------------------------------------- | gcd-ds-d | 49.6 | 52.9 | 50.2 | 49.9 | ----------------------------------------------- ------------------------------------ | SMBOX | NAN64 | NAN32 | LARGE | ----------------------------------------------- | gcd-ss-i | 5.37 | 6.36 | 5.74 | 6.00 | ----------------------------------------------- | gcd-ds-i | 5.56 | 5.93 | 5.51 | 5.76 | ----------------------------------------------- Here's variations to test worst-case tracing: * peer means "finding peers" only * exec means "finding peers + boxing/unboxing stack frame" * -s and -d mean single versus dual stack. ------------------------------------ | SMBOX | NAN64 | NAN32 | LARGE | ----------------------------------------------- | peer-s | 124.1 | 101.1 | 93.9 | 92.7 | ----------------------------------------------- | peer-d | 67.7 | 74.0 | 68.6 | 67.5 | -----------------------------------------------

David Mandelin [:dmandelin]

Reporter

Comment 22

•

16 years ago

Once again, dual stack wins. We discussed this a bit yesterday but I should document it here. Given all the results we have above, and taking into account that x64 and ARM are the platforms of the future, it seems that dual-stack is the winner, with NaN64 on x64, NaN32 on ARM, and NaN32/32T/64T/128T on x86.

Brendan Eich [:brendan]

Updated

•

16 years ago

Blocks: 360324

Ryan VanderMeulen [:RyanVM]

Comment 23

•

14 years ago

Is there more to be done here?

David Mandelin [:dmandelin]

Reporter

Updated

•

14 years ago

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WONTFIX