Closed Bug 549143 (fatvals) Opened 14 years ago Closed 14 years ago

fat unboxed values

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: luke, Assigned: luke)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(4 files, 5 obsolete files)

The idea is to split the JS engine's stack of boxed jsvals into two stacks, one containing the unboxed (64-bit) value and the other containing the type tag.  These stacks would be maintained in parallel.  The main benefits are:
 - less boxing/unboxing computation.  In particular, locals, parameters (and, with bug 548844, globals) can stay permanently unboxed.  Also, using 64-bit unboxed slots allows doubles to be stored in-place.  This is measured in a David's wonderful micro-interpreter in bug 539123)
 - allows (with additional work) very fast transition to and from trace.  Ultimately, type compatibility can be checked by a memcmp with a segment of the type stack, the tracer can use the VM's value stack instead of TraceNativeStorage (no copying in/out), and LeaveTree can simply memcpy to the type stack. See bug 539123 comment 17.

This may seem to add more stack maintenance overhead, but, (1) the contiguous stack layout (bug 540706) makes pushing/popping cheaper, and (2) JM avoids allows us to drop many stack increment/decrement operations.

This will be a fairly large change and I need to see how it can be subdivided before estimating time.  Also have to finish a few bugs first.
Summary: dual-stack layout → change JS engine to use separate unboxed value and type stacks
Depends on: 539144
Here's the rough plan:
I think this task can be broken down into the following steps:
 1. #ifdef off big hunks of runtime -- everything but whats needed to run the main interpreter loop -- and get dual stack working keeping the current jsval representation for dslots
 2. get this working with the method jit
 3. get this working with the tracing jit
 4. take out the #ifdefs from step 1 and get all automated tests running

Before step 4, we can only micro-benchmark, but each step, with the possible exception of 1, should show improvements.  Bug 539144 will benefit from and magnify the performance effects of this work.  A little grossness will need to be added to the JSStackFrame (for argv and rval), but bug 539144 and bug 550639, respectively, can make this go away.

Because we are all now officially into time estimates, here is my best guess:
 1. one month, as most of the major questions will come up and need to be figured out
 2. one week, with 3 people
 3. one week, with 3 people
 4. one week to write, with 3 people
    half a week to wade through inevitable automated test failures

This initial plan does not include, and will be followed by:
 - changing the jsval representation to avoid doubles on the heap
 - changing the set of jsval types to match JSTraceType
which really capitalize on all this work.
Status: NEW → ASSIGNED
Depends on: 552586
Slight change of plans.  As this bug will involve touching all uses of jsvals on the stack, it would not be much extra work to go ahead and upgrade to using JSTraceType as the tag of all values.  This has the immediate benefit that JSVAL_IS_NULL and (with bug 552586) JSVAL_IS_VOID, which seem to be fairly hot, can still be implemented with a single comparison (e.g., (*stp == Tag::Null) instead of (*stp == Tag::Object && sp->asObject == NULL)).
Attached patch WIP 1 (obsolete) — Splinter Review
WIP patch, mostly updating jsctxt.cpp, a lot more to do.  For those interested, jsinterp.h has the changes to key VM data structures.
Comment on attachment 433603 [details] [diff] [review]
WIP 1

>+struct DualStackPtr
>+{
>+    UnboxedValue    *p;
>+    Tag::Type       *tp;

There's an invariant relation between 'p' and 'tp', right?  Ie. given one the other can only have a single value?  Something like:

  (p - baseOfStack) / sizeof(stackEntry) == (tp - baseOfTagStack) / sizeof(tag)

From my quick skim of the patch this invariant wasn't clear.  (I may have missed something).  I wonder if it's possible to make this invariant clearer, maybe via assertions.  Alternatively, I wonder if it's possible to just use 'p' and always derive 'tp' from it when needed?  There's a space/time trade-off there, I'm not sure which would be better. 

In summary, I think it's worth thinking about this connection between the two values and whether this is the best representation.
You are correct about the invariant.  I would also like to, as you suggest, check/enforce this invariant with some js::Regs members once I get into the meat of jsinterp.  Since loading from a constant offset from regs.s.p and regs.s.tp is very common in the interpreter loop and method-jit stub calls, I think the perf overhead of recomputing s.tp would be high.  I think equivalent cleanliness can be achieved, though, by adding efficient abstract stack operations and working in terms of those.
Depends on: 554667
Attached patch WIP (RIP?) (obsolete) — Splinter Review
After transforming jscntxt.cpp and 30% of jsops.cpp, I started to get the feeling that maintaining and passing around this second stack pointer is going to cost (more registers needed, more pointer arithmetic, more parameter passing).  All this was known before starting and the reason to go ahead anyway is that keeping values unboxed has all these good properties mentioned above that we want.  However, getting into the thick of it, I'm seeing how very many places we will be incurring this overhead.

There is a different way to achieve the good properties (viz., no heap doubles and fast on/off trace transition) without two stacks: use a single stack and interleave types and (unboxed) values.  This was briefly considered before and I dismissed the idea with the following argument: "because that will waste too much space: to avoid expensive alignment penalties, the type tag needs to be padded out to 64 bits -- thats 128-bits per value!".  Essentially:

struct FatValue {
  union { ... } payload;  /* 64-bit */
  Type type;              /* 32-bit */
  uint32_t padding;       /* 32-bit */
};

Some recent measurements I performed changed my mind.  I measured how much stack we actually use for slots on a bunch of JS in the browser: SS, V8, FF startup, Google Apps, Facebook, MySpace, BubbleMark, and the Fluid Simulator.  Basically, for each opcode, I recorded how much stuff was on the stack and how much of that was slots (vs. JSStackFrame and js::CallStack).  (This is all with the contiguous stack patch.)  I found that:

 1. On average we don't have many values on the stack: SS, FB, FF, Bubble, Fluid mostly stay under <100; V8, Google Apps mostly under <400.

 2. Slots account for a smaller percentage of the bytes on the stack than JSStackFrames: usually <50%, often <20%.

 3. We have good temporal and spatial locality for stack data: the average number of ops before the stack grows or shrinks 8KB is >6e5; for 16KB, >5e7. (i.e., by recording the stack position and counting how many ops until abs(current - recorded) > X.)  FF startup doesn't even hit the 8KB limit once.

Thus, fattening stack slots up to 16 bytes seems like it wouldn't be the worst thing.  When copying values, of course, we wouldn't copy the padding, so the memory traffic would be the same as the dual stack.  Also, with this strategy, the JSAPIs don't have to change; simply fatten up jsval and define the JSVAL_* ops appropriately.  (For internal use, we'd still need a new Value type with more efficient methods, but it would be layout-compatible and cast at API boundaries.)  This greatly simplifies the calling of JSClass and JSObjectOp hooks from inside the engine and the amount of work needed to get FF doing the efficient thing.

A first step would be to have only one value type used everywhere, which means object dslots would also get much fatter.  Unlike the stack, this probably would have bad memory effects.  A logical first step would be to NaN-box dslots and box/unbox on property gets/sets.  A more ambitious change, already being discussed by Brendan et al, is to keep type tags in the property tree (thus, only storing the 64-bit paylod in dslots), having objects change JSScope when one of their properties change types, and despecializing to storing boxed values in dslots if an object's properties change type too often.  Either way, we can totally remove heap-allocated doubles and the change is local enough that the issue could be ignored for now and implemented without a second humungous patch.

So, barring convincing counter-arguments, which I'm happy to hear, I'm going to abandon the dual stack patch and start working on fat values.  Little bummed about losing a week and a half, but I guess its better than losing several months; should progress faster this time.
Attachment #433603 - Attachment is obsolete: true
(In reply to comment #6)

Agreed that there is no reason to worry about the extra space on the stack. Actually, the reason I thought we wanted the dual stack was to memcpy() typemaps back. We'll lose that ability. Synthetic benchmarks suggested that re-boxing values from the dual stack - frequently - was not a big deal, and I thought knowing that we wouldn't try to modify every internal API call even with the dual stack in place.

Anyway, if it's simpler and faster (in terms of man-power) to just change the jsval representation, sounds great to me. We'll still get all sorts of wins and we won't need as many changes to either the interpreter or method JIT.
(In reply to comment #7)
> Actually, the reason I thought we wanted the dual stack was to memcpy()
> typemaps back. We'll lose that ability.

memcmp and memcpy are out, true, but I think the critical performance property is preserved, viz., that matching peers and LeaveTree can use simple, unrollable loops which don't contain unpredictable branches.

> Synthetic benchmarks suggested that
> re-boxing values from the dual stack - frequently - was not a big deal

That was not the primary reason for the change.  On the subject, though, does bug 539123l comment 16 test always boxing the same type (always integers / always doubles) or a randomized (hence, unpredictable) mix of 3 or more types?

> Anyway, if it's simpler and faster (in terms of man-power) to just change the
> jsval representation, sounds great to me.  We'll still get all sorts of wins
> and we won't need as many changes to either the interpreter or method JIT.

Assuming no negative memory effects, less pointer arithmetic, live registers, and parameter passing should lead to faster code too.  It may not end up mattering, but bug 552574 seems to indicate, for hot paths, that these things do matter.  But I agree that what really matters is avoiding heap doubles and allowing fast trace transitions; dual-stack vs. interleaved single-stack is mostly a matter of engineering.
Depends on: 557404
No longer depends on: 557404
Attached patch WIP 1 (obsolete) — Splinter Review
Work in progress.  I decided to convert jsobj.cpp because it works heavily with ids, JSObject*s, and class hooks, all of which are potential sticky points, but it all seemed to go ok.

One annoyance is that JSObject* can be one of three (new) jsval types: null, function object, or non-function object.  To avoid having to do obj->isFunction() every time one of these turns into a jsval (something that was free before), I changed some functions to pass pointers to Values containing objects (thereby not losing the type information in the common case where the JSObject* comes from a Value).  Because this happens a fair amount, I made a new ObjPtr value type which is a restriction of Value to the three types a JSObject* can be.  ObjPtr injects into Value and Value conditionally restricts into ObjPtr.

Other than that, fairly straightforward work (compared to dual stack), it just touches practically every function.
Attachment #435776 - Attachment is obsolete: true
Summary: change JS engine to use separate unboxed value and type stacks → fat unboxed values
Depends on: 561706
Attached patch WIP 2 (obsolete) — Splinter Review
This patch contains the conversion of jsinterp.cpp and jsops.cpp.  Altogether, more than 14KLOC examined, ~6KLOC touched.  Compared to the dual-stack conversion, this is much easier.  Some of the change is mechanical, but a good bit of it is not, mainly because (1) the set of value types was changed to match JSTraceType and (2) jsvals are often copied when no copy is fundamentally necessary so local refactoring is required to avoid unnecessary overhead for fat values.

To make copies visible, I hid Value::operator= and make the copy ctor explicit.  When the conversion is finished, though, these can be public/implicit so the code can look pretty.

I also realized that a good way forward is to not make a mini-interpreter by #ifdef'ing off big chunks, but rather to get the whole interpreter building, with most of the engine using the old jsval type and JSVAL_* macros.  This would allow early testing/fuzzing and, once finished, parallel development can really get going.  So I'm going to try to get that going next.
Attachment #439401 - Attachment is obsolete: true
Depends on: 562991
Attached patch WIP 3 (obsolete) — Splinter Review
This patch contains conversion for: jsapi, jsarray, jsatom, jsbool, jscntxt, jsdate, jsdbgapi, jsemit, jsfun, jsgc, jsinterp, jsobj, jsops.  Things are going a lot faster now.  To get a !JS_TRACER build, the remaining files that seem like they'll take time are jsiter (perhaps less so than before thanks to Andreas) and jsstr.  I need to do a monster rebase though, and this seems like a good time to switch to a user repo.

1.2MB patch, 7.8K lines +, 6K lines - (much of the positive growth is due to new declarations/comments in jsapi.h and jspubtd.h).
Attachment #441659 - Attachment is obsolete: true
Depends on: 565157
Attached patch WIP 4Splinter Review
Only jsxml to go (and probably a few loose ends), and then it should be able to build (without JS_TRACER defined).  I ended up converting most of the files to use the new API (unlike my proposed plan in comment 10), since it was often easier.  I think only json, jsopcode, and (soon) jsxml are left unconverted.  The patch is 1.7M, but now in a user repo:

 http://hg.mozilla.org/users/lwagner_mozilla.com/fatval

Soon begins the time of great hurting, i.e., getting all automated tests to pass.  After things aren't totally broken, parallel work on tracing/jaegering can begin.
Attachment #444487 - Attachment is obsolete: true
As of http://hg.mozilla.org/users/lwagner_mozilla.com/fatval/rev/547f50803955, everything builds with --disable-tracejit.  I took out the explicit copying, since most everything has been converted.  I also removed ObjPtr because it didn't turn out to be very useful.  Time to find all those bugs I've spent a month carefully hiding.
Tagless, bitches!

Awesome.

/dvander
Alright, fixed enough bugs to get some loops running in a 32-bit opt build:

var d = .11;
for (var i = 0; i < 10000000; ++i) {
    d += .23;
    d -= .4;
    d *= .045;
}

trunk: 2008ms    patch: 1611ms

var d = .11;
for (var i = 0; i < 10000000; ++i) {
    d = Math.sin(d);
}

trunk: 1891ms   patch: 1480ms

var d = 1;
for (var i = 0; i < 10000000; ++i) {
    d += 2;
    if (d > 1000000)
        d -= 2000000;
}

trunk: 997ms    patch: 869ms

OTOH, the empty loop, and loops using only increment (where we do tricks to avoid boxing/unboxing) go slower, since they benefit little from boxing/unboxing and fat values require more copying.  So, once more bugs are squashed, I'll have to look more carefully, but hopeful result so far.
Luke, I assume those are -j numbers?
(In reply to comment #16)
> Luke, I assume those are -j numbers?

No, I only having things building/running with --disable-tracejit.  Tracing will need to proceed in two phases: getting it working at all then getting it to work in-place (in the contiguous stack).
Ah, false alarm on the empty loop/slow-on-ints comment above.  First, I noticed that this loop got faster:

for (var i = 0; i < 10000000; i += 1) {}

trunk: 558ms   patch: 492ms

So that seemed to finger JSOP_GVARINC as the culprit.  Looking at the impl, I noticed I was doing some copying on account of the getSlotMT() call which (not for js shell (which I am using to measure), but for JS_THREADSAFE builds) must return a copy.  Now, it seems the need for getSlotMT/setSlotMT is about to go away in the compartment world (JM already assumes this), so I switched to using getSlotRef() and working in-place to avoid copies, and regained parity on the empty loop (using ++).  In fact, the following loop switched from a slowdown to speedup:

var d = 0;
for (var i = 0; i < 10000; i++) {
    for (var j = 0; j < 1000; j++)
        d++;
}

trunk: 650ms    patch: 433ms

This is all highly synthetic, but it calms some of my worries that fat-values are only good for double-heavy code.  It also highlights the importance of avoiding unnecessary fat-value copying in hot code.

Still need to squash some bugs before I can run real benchmarks.
Trace tests pass, so time for SunSpider (this is interp-only):

** TOTAL **:           1.058x as fast    1926.1ms   1820.3ms     significant

=============================================================================

  3d:                  1.176x as fast     289.1ms    245.9ms     significant
    cube:              1.108x as fast     100.6ms     90.8ms     significant
    morph:             1.39x as fast       97.3ms     70.2ms     significant
    raytrace:          1.074x as fast      91.2ms     84.9ms     significant

  access:              1.036x as fast     352.8ms    340.4ms     significant
    binary-trees:      *1.103x as slow*    35.1ms     38.7ms     significant
    fannkuch:          *1.131x as slow*   146.5ms    165.7ms     significant
    nbody:             1.32x as fast      122.3ms     93.0ms     significant
    nsieve:            1.137x as fast      48.9ms     43.0ms     significant

  bitops:              1.050x as fast     256.7ms    244.5ms     significant
    3bit-bits-in-byte: 1.22x as fast       57.8ms     47.2ms     significant
    bits-in-byte:      *1.073x as slow*    61.3ms     65.8ms     significant
    bitwise-and:       1.028x as fast      59.3ms     57.7ms     significant
    nsieve-bits:       1.061x as fast      78.3ms     73.8ms     significant

  controlflow:         1.074x as fast      36.1ms     33.6ms     significant
    recursive:         1.074x as fast      36.1ms     33.6ms     significant

  crypto:              1.129x as fast     144.9ms    128.3ms     significant
    aes:               *1.057x as slow*    54.4ms     57.5ms     significant
    md5:               1.29x as fast       44.2ms     34.3ms     significant
    sha1:              1.27x as fast       46.3ms     36.5ms     significant

  date:                *1.159x as slow*   116.7ms    135.2ms     significant
    format-tofte:      *1.187x as slow*    51.8ms     61.5ms     significant
    format-xparb:      *1.136x as slow*    64.9ms     73.7ms     significant

  math:                1.148x as fast     264.3ms    230.2ms     significant
    cordic:            1.032x as fast     100.9ms     97.8ms     significant
    partial-sums:      1.153x as fast     104.1ms     90.3ms     significant
    spectral-norm:     1.41x as fast       59.3ms     42.1ms     significant

  regexp:              ??                 153.3ms    153.7ms
    dna:               ??                 153.3ms    153.7ms

  string:              1.012x as fast     312.2ms    308.5ms     significant
    base64:            *1.040x as slow*    32.1ms     33.4ms     significant
    fasta:             *1.020x as slow*    64.9ms     66.2ms     significant
    tagcloud:          *1.033x as slow*    75.8ms     78.3ms     significant
    unpack-code:       1.033x as fast      93.3ms     90.3ms     significant
    validate-input:    1.144x as fast      46.1ms     40.3ms     significant

So there are a few tests that need looking into but this seems like a generally positive result, at least enough to push on to see what happens with tracing.  Need to fix a bug or two before I can run v8...
The v8 bug was that v8-splay currently allocates up to the 64MB shell runtime heap limit, and, with fat values, this limit needs to be bumped because dslots are 4x bigger.  As comment 6 points out, the goal is *not* to use fat values in dslots, because, as opposed to the stack, bloating dslots could really hurt.

That's a good segue to the v8 results.  On 32-bit, there is a 4% slowdown and 64-bit, a 6% slowdown.  A simple measurement of how many bytes are being cx->malloc'ed shows a 1.5x-2.5x increase.  E.g., v8-deltablue goes from 14MB to 31MB.  So that means its time to investigate what to do about dslots.

For fun, I tried taking out the padding (so that values are 96-bit, but every-other value is misaligned for doubles).  This gives worse results for v8 (7% total slowdown) but better results for SS (6.2% total speedup).

64-bit values with NaN-boxing is promising.  IIUC, for 32-bit, the encoding JSC uses can be particularly inexpensive when boxing/unboxing non-doubles, so I'm going to investigate that next.  (Sticking types in the property tree and storing 64-bit payloads in dslots is also very interesting, and could benefit tracing performance independently, but it is a much bigger change and has non-trivial potential weaknesses to address, so I'm going to hold off on that.)
Heretically simple idea: what if we just bumped the 64MB limit?
(In reply to comment #21)
Heh, that was necessary for v8-splay to complete at all.
This builds and passes trace-tests on Win x86 as of 15ed2b53e5d5.
I tried enabling 8-byte alignment on Windows for Value using __declspec(align(8)), and it just made things slower. So I took out everything to do with trying to align values on Windows.
I changed the implementation of js::Value (and the few pieces of code that peek at its privates) to use nan-boxing (only working on 32-bit atm).  This made SS go from 5.8% faster to 9% faster.  V8 changed from 4% slower to ~1% slower.  So halving the js::Value size really helps, but not quite enough.  However, as described in comment 0, v8 should get an additional boost (compared to trunk) when tracing works, so it seems like this slowdown will turn into a speedup.

Still left to try is nan-boxing on 64-bit.
As of 12b2567a413d, this passes trace-test and jsreftests on Win x86.
This makes XPConnect build, at least.
I got 64-bit nan-boxing working and, testing on Linux x64, SS is 6% faster and v8 is altogether a tiny bit faster (40ms), although a bit slower on some individual tests.  Strangely, on microbenchmarks, loops full of integer operations get up to 40% faster whereas loops full of double operations only get 10% faster and even sometimes a bit slower.  Perhaps I'm doing something silly or perhaps going back and forth between treating values as integers and doubles is expensive.  This warrants further investigation.

The 64-bit nan-boxing scheme used is the same as the 32-bit one, which keeps payloads and type bits unboxed (henceforth referred to as nunboxing).  For this to work, though, all pointers stored in values must be in the low 4GB.  This is easy on Linux, but requires some trickery (as demonstrated by LuaJIT) for OS X and Windows.  So that needs to be done before moving on to getting tracing to work.
Now I'm getting that v8 is 1.4% faster (significant, runs=40) with nunboxing on Linux x64 and none of the microbenchmarks are slower (although the speedup is still only about 10% faster, which seems low).  Weird.  Moving on.
Sounds great. Does this mean we can basically plan to go forward with nunboxing on all platforms?
Nunboxing definitely seems good for now.  There is a cheap conversion between nunboxes and 128bit fat vals, so sometime later it might be interesting to try out fatvals-on-stack, nunboxes-on-heap.  This might be a win since, although nunboxes are basically unboxed, a few ops are a bit cheaper with fatvals due to the complete separation of type and payload.
I got nunboxing working (and checked in) on Win x64. This includes moving the atom tables, so it should work on Linux as well.

But it may be a no-go on Mac. I tried the LuaJIT 2 thing, and it fails on 10.6. Given a starting address of 0x10000, mmap returns 0x101000000 every time. We can ask around if there is a way to mmap 32-bit addresses on x64 OSX, but we may just have to use 128-bit vals there.
As previously noted, this doesn't work. But feel free to look at it.
Sorry if this should be obvious from the preceding -- we don't want to use 48-bit pointers and rely on canonical addressing, because it means we have to mask out the types rather than just do a 32-bit load for either of type and payload, right?

I guess if that was OK then the address we get from mmap wouldn't be a problem, hmm.
(In reply to comment #32)
> But it may be a no-go on Mac. I tried the LuaJIT 2 thing, and it fails on 10.6.
> Given a starting address of 0x10000, mmap returns 0x101000000 every time. We
> can ask around if there is a way to mmap 32-bit addresses on x64 OSX, but we
> may just have to use 128-bit vals there.

I think the lowest 4G on 64-bit darwin isn't usable; apparently a
deliberate scheme to cause non 64-bit clean programs to fail in an
obvious way.  See

http://developer.apple.com/mac/library/documentation/Darwin/Conceptual/64bitPorting/building/building.html

This 4G limit is also wired into valgrind for 64-bit darwin, as it has
to play some address space games and thus be aware of this issue.
keima:src dvander$ file luajit
luajit: Mach-O 64-bit executable x86_64

1082	  tbase = (char *)(CALL_MMAP(tsize));
(gdb) n
1083	  if (tbase != CMFAIL) {
(gdb) p tbase
$2 = 0x200000 ""

LuaJIT is doing something right here, but I don't know what. Either way it seems like we should have a backup plan.
    TARGET_XLDFLAGS+= -pagezero_size 10000 -image_base 100000000
    TARGET_XSHLDFLAGS+= -image_base 7fff04c4a000
New JM compiler for fatval branch is taking place here:

  http://hg.mozilla.org/users/danderson_mozilla.com/moo

There is now enough in place to run bitops-bitwise-and, which seems to have gone from 6ms to 3ms. Given that this test is dominated by NAME/SETNAME, nice win for nunboxing.
This is like the patch for XPConnect, but adds stuff for ctypes, jsd, and some DOM/general browser parts. I just put in multiple patches from my queue. The order is: browserfix, browserfix-js, (the xpconn patch), moz, jsd, jsd-fake (workaround that builds without actually working correctly), browser-2, dom.

I think the first patch, and possibly the second, are mostly copies of other browser patches I got from Luke.
I just pushed a fix that lets us build and pass trace-tests on OS 10.6 64-bit using the flags in comment 37.

This still seems pretty sketchy--all binaries built that use JS would have to make sure to have that flag. We should probably make a 64-bit Mac version using 128-bit values or 64-bit nan-boxing once the 32-bit version is working well.
+Function("switch(\"\"){case 1:case 8:}")

asserts 32-bit js debug shell on Windows 7.

With fatval repo e2b5d7222117:

$ ./js-dbg-32-fv-nt.exe
js> +Function("switch(\"\"){case 1:case 8:}")
Assertion failure: JSBOXEDWORD_IS_STRING(w), at c:\jsfunfuzz-dbg-32-fv-41918-e2b5d7222117\compilepath\jspubtd.h:702
Alias: newjsvals
Depends on: 569651
Depends on: 569766
Status update:

There is good work in parallel on getting this ready to land on TM; dmandelin is working on getting tracing working; dvander is working on method-jitting; I'm continuing dmandelin's work on getting the browser building and passing tests.  It seems like roughly 2 weeks to get dmandelin's and my work finished.  After that, it might be a good idea to have a flag day where everyone takes a .cpp or two and audits the diff for lame bugs that appeared in the translation.  There are also a few places (gc, iterator, parser, decompiler) where the changes are non-trivial and need a careful eye.  Hopefully that could all happen in a few days.

There is also the issue of the shaky tricks on OS X to get nunboxing to work and the fact that tier >1 platforms without MAP32_BIT or first-fit allocator algorithms would be SOL.  To fix this, we've talked about a modified 64-bit scheme that is close to what Shaver said in comment 34: 13 bits of nan, 4 bits of type, 47 bits of pointer (thereby making the much milder assumption (as JSC already does) that that userspace is running in the canonical lower half of the address space).  This loses the attractive property of one-encoding-to-rule-them-all, but even if a single encoding was used, it is still necessary to specialize operations for 32 and 64 bit.  Also, upon further consideration of the individual operations, such a scheme is hardly worse than nunboxing.

Together, the estimate is 3 weeks.
Firefox now builds with fatvals.  Now working on startup and getting tests to run.
xpcshell tests pass. mochitests crashes early somewhere in Ogg and reproduces on the last cset from TM.  I guess that means its time to rebase...
Depends on: 572298
Update:
The browser builds with tracing and passes most tests (a few mochitest failures left to work out).  Things go slightly slower in the shell with tracing, but bug 571623 comment 2 shows fatvals have a big perf improvement in browser builds.  So this should be landable on TM soon.

Sharking shell builds showed slowdown in functions which manipulated jsids.  This makes sense since jsids have the same 64-bit representation as jsvals (since so much code treated the two loosely).  To get back some of this perf, I made jsid word-sized which gives a 1.2%/2.4% bump on SS/V8.  Still working on getting this change to work in the browser.
Depends on: 572842
Depends on: 573171
Depends on: 573531
Depends on: 573549
Depends on: 573574
Depends on: 573578
Depends on: 573604
As of http://hg.mozilla.org/users/lwagner_mozilla.com/fatval/rev/ed42260c48fb we pass mochitest-plain in the browser on Windows 7 (except one test that is probably not due to fatvals--see bug 562955).
In getting ready for the new 64-bit jsvals, I tweaked nunboxing on 32-bit to store a dense type index (instead of bitfield) which allows tracer functions like determineSlotTypes, NativeToValue, ValueToNative, IsEntityTypeCompatible, unbox_value, GetPromotedType, and getCoercedType to take advantage of the close mapping between js::Value types and TraceType.  This gave roughly a 1-2% speedup on SS and V8.  This is good, but not as high as originally hoped; it seems that while trace transition time is significant on V8 (measured with tracevis), these functions are not a large part of it.  Having traces work directly on the VM stack (instead of copying back and forth between the VM stack and TraceNativeStorage) is also possible with fatvals and may cut away more trace transition time, but that is for a future bug.
I am about to add a fatval hazard in a patch. The function is JSCrossCompartmentWrapper::filterIterator(). Depending on who lands first we have to fix this. Its a real _real_ pity that the iterator id/val split isn't its own patch that can land early. I keep conflicting you in that area. Just making a note here so it doesn't get lost.
Depends on: 574518
(In reply to comment #48)
> I am about to add a fatval hazard in a patch. The function is
> JSCrossCompartmentWrapper::filterIterator(). Depending on who lands first we
> have to fix this. Its a real _real_ pity that the iterator id/val split isn't
> its own patch that can land early. I keep conflicting you in that area. Just
> making a note here so it doesn't get lost.

fatvals is really important.  I don't know how important your patch is.  Based on my limited info I suggest you wait for fatvals to land and then rebase.
Depends on: 574745
Depends on: 574874
Depends on: 574881
Blocks: 574969
Depends on: 575345
Update: 64-bit works with tracing and passes tests.  In a shell on OS X 10.6, I'm getting 1% faster on SS and 4-6% faster on V8.  But, confoundingly enough, running these two in the browser on 10.6 shows only a minute speedup.  There are a couple suboptimal parts in the 64-bit tracer integration (which could be improved with the help of bug 574969 and bug 575527), and I haven't spent any time profiling, so perhaps there are some easy gains lurking.

Still remaining are some small things that need to be fixed before landing.  Also, I'd like to try removing the funobj/nonfunobj distinction in js::Value to see if it makes any measurable difference.  All else being equal, it would be simpler not to have it.

It has been suggested that, given the size of the patch and the difficulty in coordinating a traditional review, this could all be landed and reviewed after the fact.  Most of the changes are local and simple, and the non-trivial changes have been discussed with those involved.  We have a temporary tinderbox (Cedar) that we will ensure is green before pushing.  Ideally, .cpps would be handed out to so that noone was overburdened.  If all goes well, landing could happen before the weekend.
I think we urgently have to profile why we get a 20% speedup on x86 in the browser. Something smells terribly wrong there. And this landing is totally going to mess up my patch queue. That having said, I agree that we should land it and then go over it. This has grown to enormous proportions and it needs to go in asap to avoid any further conflicts. I am happy to help reviewing.
Testing is far more imp. than review for a big, mostly mechanical change like this. Suggest you call out the places you want reviewed, perhaps even attach a custom-cut patch.

Unifying fun/non-funobj sounds good, even though tracing keeps the type split.

/be
We've talked a bit about it in the office, but for posterity: the fun/non-funobj split probably hurts JM in the end, because it means we have to sync/load type tags to/from memory in a few hot cases. The only places I can think of it really helping are JIT'd method read/write barriers.

It should be easy to test though, once the tag is gone.
(In reply to comment #50)
> Still remaining are some small things that need to be fixed before landing. 

Is conservative GC scan already updated with the value packing schema?
(In reply to comment #53)
> We've talked a bit about it in the office, but for posterity: the
> fun/non-funobj split probably hurts JM in the end, because it means we have to
> sync/load type tags to/from memory in a few hot cases. The only places I can
> think of it really helping are JIT'd method read/write barriers.

And there, it matters only when *filling* the PIC or property cache. Hit rate has to be enough to hide this in the noise, or something's wrong. Note we use static analysis to despecialize from function values to slots (unbranding).

> It should be easy to test though, once the tag is gone.

Sure -- measure 2x, cut 1x ;-).

/be
(In reply to comment #54)
Need to rebase first, but its on the todo list.
There was little to no loss from merging funobj and nonfunobj in js::Value, and great simplification for the programmer, so I merged the two (keeping the distinction in the trace-jit, of course).

On a more general performance note:

I am not able to reproduce the big browser gains reported in bug 571623 comment 2.  On most OSes with x86, SS does not report a significant change.  V8 in the browser shows enormous (100 points for scores of 1000) variation on all 3 OSes.  Running it multiple times, 32-bit looks mostly the same (all OSes), but 64-bit OS X shows about 4% improvement.

One interesting thing is that v8-splay is 30% slower on 32-bit (all OSes) and 0% change on x64.  All the time is spent in the 'setup' step where v8-splay builds a giant binary tree of objects with arrays as leaves.  Making a microbenchmark that simply allocates a bunch of objects shows a similar 30%/0% change on x86/x64.  This all points quite clearly to increased the memory consumption resulting from sizeof(jsval) doubling as the cause.  Type-specific arrays (bug 566767) would presumably recover a good bit of v8-splay by using int32s in the arrays.  But, in general, this seems to be an essential cost of 64-bit values on 32-bit systems which, for JM, we have that we want.

On the other side, some benchmarks benefit enormously: access-nbody is 2.56x faster, nsieve-bits is 45% faster, and v8-richards is 38% faster.  When we are leaning more heavily on the method-jit as our baseline (as opposed to using the trace-or-die heuristics we have now on TM), it seems like more tests would fall in this category that currently are showing no speedup.

Judging from shark profiles, there seems to be a 1-2% speedup still to be had on both 32-bit and 64-bit by removing js_UnboxDouble and some unfortunate use of insAlloc using mini-phi nodes (bug 575527) and LIR_qasd/LIR_dasq (bug 574969).  An indeterminate additional speedup also remains on 32-bit by not insAlloc'ing js::Value arguments to natives.  Nick has identified several cases where this is causing code not to be CSE'd and DCE'd where it was before.

So, enormongous patch with light speedup on TM, but I think it is justified in what it allows JM to do going forward and its win on 64-bit.
(In reply to comment #57)
>
> One interesting thing is that v8-splay is 30% slower on 32-bit (all OSes) and
> 0% change on x64.  All the time is spent in the 'setup' step where v8-splay
> builds a giant binary tree of objects with arrays as leaves.  Making a
> microbenchmark that simply allocates a bunch of objects shows a similar 30%/0%
> change on x86/x64.  This all points quite clearly to increased the memory
> consumption resulting from sizeof(jsval) doubling as the cause.

See comment http://code.google.com/p/v8/issues/detail?id=690#c2.  It claims that in the proper version of V8 (ie. *not* what you get if you run it in the SunSpider harness) that the setup/teardown costs aren't counted.  But maybe that's incorrect.

> Type-specific
> arrays (bug 566767) would presumably recover a good bit of v8-splay by using
> int32s in the arrays.

That's assuming we end up specializing arrays-of-ints as arrays-of-ints, rather than arrays-of-doubles.  My results so far show arrays-of-ints causing slowdowns on the 3d benchmarks, for the usual reasons -- lots of values start off as integers but end up as doubles, so specializing for integers is bad.

> But, in general, this seems to be an essential cost of
> 64-bit values on 32-bit systems which, for JM, we have that we want.

Yes, I still think this a good change even if the v8-splay slowdown cannot be avoided.
(In reply to comment #58)
> It claims that in the proper version of V8 (ie. *not* what you get if you run
> it in the SunSpider harness) that the setup/teardown costs aren't counted.

Ah, interesting, thanks.  I was indeed looking at v8-splay in the SS harness in the shell for that.
(In reply to comment #59)
> (In reply to comment #58)
> > It claims that in the proper version of V8 (ie. *not* what you get if you run
> > it in the SunSpider harness) that the setup/teardown costs aren't counted.
> 
> Ah, interesting, thanks.  I was indeed looking at v8-splay in the SS harness in
> the shell for that.

I also know that the proper version of V8 has a deterministic Math.random(), which affects v8-crypto.js (makes its runtime much less variable).
What?!  I thought every good benchmark needed a nondeterministic Math.random!
Rebase and other 'todo' items complete.  The only outstanding issue is that, while bug 571623 removed jsval as the return type of all specialized natives, bug 560643 adds jsval as a return type to XPIDL.  To get everything building/running I've turned traceable native generation off (again).  With that, however, the orange on the Cedar tinderbox matches TM tip and known random orange, so it looks like everything else is ready to go.
Can we back out the XPIDL change until we have a fix and land fatvals?
I recall Peter saying that there are already uses.  Jason said it was a popular request.
No longer depends on: 539144
So nevermind!  While jsval *is* in use in quick-stubbed fast natives it is not being used for any *traceable* natives (which I only realized when I found a bug in qsgen.py that would have thwarted any attempts do do so (traceReturnTypeMap is keyed on 'jsval', not '[jsval]')).  So I can just rebase Peter and Dave's fine patch.

Assuming I can get another green cycle on Cedar after this, I think this means its landing time.
> it is not being used for any *traceable* natives

We should probably file a bug on this (to be fixed after your merge), since the nontraceable quickstub natives kill us in tracer...
Depends on: 578547
Blocks: 578917
http://hg.mozilla.org/tracemonkey/rev/9c869e64ee26
Whiteboard: fixed-in-tracemonkey
Blocks: 578922
Blocks: 579100
Blocks: 579140
Bug 557423 comment 75 has some positive results from fatvals on WebGL Quake:

> Summary (by "before" I mean the results reported above in this bug report, 
> i.e. mozilla-central from a few weeks ago):
>
> * In-game performance is noticeably improved, and the quake2 benchmark now
>   reports 9.5 FPS, up from 8.9 FPS before.
>
> * What these numbers don't say, is that the benchmark *loading* speed is much
> improved. Runs an order of magnitude faster now.
>
> * From what I understand, profiler results (see below) seem to show that the
>   crazy GC'ing is indeed fixed now --- presumably thanks to fat values.
No longer depends on: 572298
Depends on: 579793
Depends on: 579937
Depends on: 578812
No longer depends on: 579937
No longer depends on: 581486
Depends on: 582523
Depends on: 583618
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Depends on: 584453
Alias: newjsvals → fatvals
(In reply to comment #66)
> > it is not being used for any *traceable* natives
> 
> We should probably file a bug on this (to be fixed after your merge), since the
> nontraceable quickstub natives kill us in tracer...

This may be a stupid question, but was a bug filed as requested above?
Yes, and fixed in bug 571623.
No longer depends on: 582523, 583618
 (In reply to comment #42)
> To fix this, we've talked about a modified 64-bit
> scheme that is close to what Shaver said in comment 34: 13 bits of nan, 4 bits
> of type, 47 bits of pointer (thereby making the much milder assumption (as JSC
> already does) that that userspace is running in the canonical lower half of the
> address space). 

From a cursory examination of the final patch, it appears to me that this is the final encoding used for 64-bit builds -- is this correct? If so, was any particular hackery necessary to ensure allocations landed in a 47-bit-safe address space, or do all currently-interesting 64-bit OS's do this by default?
No hackery is/was necessary for x86_64.  Hackery is in progress for Sparc64 (the bug # escapes me).  It's not terrible, though, since the only pointers stored in values (modulo static strings) are gc-things, so are allocated by page in a single place.

To wit: there is an experiment underway to go back to 128-bit fatvals on 64-bit systems to avoid the boxing/unboxing cost in 64-bit mjit code (I know, will we just pick a representation already?!  Fortunately, after the initial abstraction effort, its much easier to change these things now.)  I don't think there is a bug for it yet.
Oh, for reference on why x86_64 didn't require any hackery:
http://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details
(In reply to comment #76)
> Oh, for reference on why x86_64 didn't require any hackery:
> http://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details

Ah, excellent reference, thanks; I knew about the high-bit-replication design of x86-64 but didn't know that the high half was effectively reserved by current OS's for kernel space -- good to know!
No longer depends on: 618574
Depends on: 624011
Depends on: 652060
Landed a followup to fix this build warning introduced by a patch on this bug:
> xpidl_header.c:642:15: warning: C++ style comments are not allowed in ISO C90
http://hg.mozilla.org/integration/mozilla-inbound/rev/1679a3cf1e2b
Depends on: 781058
Depends on: 584252
Depends on: 981371
Occasionally, I come here to find

https://blog.mozilla.org/rob-sayre/2010/08/02/mozillas-new-javascript-value-representation/

which has apparently been killed for some reason, but a copy appears to have been archived at

https://evilpie.github.io/sayrer-fatval-backup/cache.aspx.htm

(thanks, evilpie!)
You need to log in before you can comment on or make changes to this bug.