Closed Bug 492866 Opened 16 years ago Closed 15 years ago

nanojit: variable-width LIR

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(3 files, 5 obsolete files)

Bug #488775 widened LIR from 1 word to 4 words.  This simplified things greatly (no more trampolines, for example) and gave slight speedups, at least one some machines.  But the new representation is wasteful of space, because many LIR instructions could fit in 1, 2 or 3 words.

I've started work on a patch to implement variable-width LIR, which avoids this wasted space.  I suspect it will reduce the average size of each LIR instruction to roughly 2.5 words.
Depends on: 492292
Depends on: 492488
This patch implements variable-sized LIR as described above.


I did some size measurements, of three versions:

- original-LIR: just prior to widen-LIR  (TM r28151)
- wide-LIR:     just prior to this patch (TM r28184)
- variable-LIR: this patch               (TM r28184 + patch)

The amount of bytes allocated for code (at the instruction level, not the
page level) is as follows:

- original-LIR:  7,559,784 bytes
- wide-LIR:     16,828,544 bytes  (2.20x more than original-LIR)
- variable-LIR: 12,210,008 bytes  (1.59x more than original-LIR)

Comparisons of the first against the latter two aren't perfect, as 33
revisions have occurred so there might be some minor tracing differences,
but it should be pretty close.  variable-LIR is 0.73x the size of wide-LIR.
The per-instruction average with variable-LIR is 15.5 bytes, which is a 
bit more than the 10 bytes (2.5 words) I estimated, it is pushed up by skip
and call instructions.


I also did perf measurements:

- For SunSpider, differences were negligible.

- For V8, crypto is much faster -- it went from 750--800 to 1400--1500;  this
  occurs both when running all of V8 and when running crypto alone.  The
  other V8 benchmarks had no discernible change.


Notes to the reviewer:

- I recommend reading the changes to LIR.h first.

- I was sorely tempted to put the methods in class LIns into a sensible
  order, but didn't in order to minimize the size of the diff.

- All LIns get/set functions now check via assertions that the opcode is
  appropriate for the set/get field.  Previously only some did.

- Some set methods weren't needed, so have been removed.

- ensureRoom() and commit() were merged and renamed makeRoom();  having them
  separate was no longer necessary, and a potential source of problems if
  the sizes given to them didn't match.

- Previously lots of the sizes in LIR.cpp were in terms of sizeof(LIns);
  now they are all in terms of bytes.

- insSkipWithoutBuffer() was inlined into the two places where it was
  called;  this made those two places much easier to understand.

- LIR instruction counting was moved into makeRoom() because it could be.

- There are some seemingly dead opcodes that are explicitly unhandled in
  LirReader::read(), and a comment about it.  I'm not entirely sure what to
  do about them, but I didn't want to write code to handle them that I
  couldn't test.  Bug #493125 has more details about this.
Attachment #377635 - Flags: review?(edwsmith)
The measurements in comment #1 were on LInux.  I tested V8 on Mac also, the crypto improvement occurs there too, going from about 900 to about 1550.  Both Linux and Mac measurements were done after Graydon's memory pressure measurement fix (#492673) landed.
We are probably hitting the max cache size.
Do you mean we're no longer hitting the max cache size due to the LIR width reduction?
Yeah, looks like we were threshing the code cache in crypto. Thats the only possible explanation.
But we went from small LIR (original-LIR) to wide (wide-LIR) to intermediate (variable-LIR).  And it was only with the last step that the increase occurred -- perf was always around 800 for crypto before variable-LIR, right?
(In reply to comment #6)
> But we went from small LIR (original-LIR) to wide (wide-LIR) to intermediate
> (variable-LIR).  And it was only with the last step that the increase occurred
> -- perf was always around 800 for crypto before variable-LIR, right?

Yeah, I think that's what happened.
Assignee: general → nnethercote
We'd need to check all three LIR widths to understand this correctly. It looks to me like the initial narrow LIR might have been checked without Graydon's patch.
Attached patch rebased patch (obsolete) — Splinter Review
Rebased patch.
Attachment #377635 - Attachment is obsolete: true
Attachment #378238 - Flags: review?(edwsmith)
Attachment #377635 - Flags: review?(edwsmith)
(In reply to comment #8)
> We'd need to check all three LIR widths to understand this correctly. It looks
> to me like the initial narrow LIR might have been checked without Graydon's
> patch.

I just checked four versions with and without Graydon's patch (which was 
TM r28184):

                                                  without patch  with patch  
- original-LIR (TM r28151)                        495            750
- wide-LIR     (TM r28152)                        250            780
- wide-LIR-later (TM r28417)                      280            1480
- variable-LIR (TM r28417 + rebased patch above)  280            1480

So Graydon's patch makes a big difference in all cases.  And it looks like something else happened between 28152 and 28184 that caused the 780 --> 1480 improvement -- the move to variable-LIR isn't responsible for that.
Depends on: 494084
In light of the fun Graydon and I had today with bug 494084, I'd like to let the wide-LIR representation marinate a while longer before switching to variable-width LIR.  In the meantime, there are several improvements that can be made from the above patch that can be applied to wide-LIR.  So I'm closing this bug and will open several smaller ones for them.  Variable-LIR can happen after they're in.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → WONTFIX
If we're planning to do variable-width LIR at some point, this isn't WONTFIX.  It might depend on the other bugs you mention, though.
Assignee: nnethercote → general
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Attachment #378238 - Flags: review?(edwsmith)
Depends on: 494639
Blocks: 495158
Attached patch updated patch (obsolete) — Splinter Review
Patch implementing variable-width LIR.  Each opcode is categorised according
to it's "representation kind", and this representation kind is now part of
LIRopcode.tbl.  (Since I was modifying almost every line in LIRopcode.tbl, I
took the opportunity to add comments to all opcodes.)

I recommend reading LIR.h before LIR.cpp, it'll make a lot more sense that
way.
Assignee: general → nnethercote
Attachment #378238 - Attachment is obsolete: true
Attachment #383862 - Flags: review?(graydon)
Attachment #383862 - Flags: review?(edwsmith)
The unused opcodes should be all 0-ary. Also, maybe remove imm8b() and make it argc(). Its only used by param and call, so just give it whatever names those use. imm8b is really cryptic.
(In reply to comment #14)
> The unused opcodes should be all 0-ary. Also, maybe remove imm8b() and make it
> argc(). Its only used by param and call, so just give it whatever names those
> use. imm8b is really cryptic.

arg() is already used for getting call arguments.  We could overload it (the existing one takes an integer argument) but I think that's a bad idea.  How about paramArg() and paramKind()?
I used paramKind() and paramArg().  And -1 for the arity of unused opcodes -- the effect is the same as 0 (since all the users of the arity only look for 1 or 2) but it's more obvious that it needs changing if an opcode becomes used.
Attachment #383862 - Attachment is obsolete: true
Attachment #383880 - Flags: review?(graydon)
Attachment #383880 - Flags: review?(edwsmith)
Attachment #383862 - Flags: review?(graydon)
Attachment #383862 - Flags: review?(edwsmith)
graydon, edwsmith, any news on reviews here?
Patch looks great. Probably needs some very mild rebase, but could happen post review. lastWord.opcode wants to be an inline opcode() function I think.
(In reply to comment #18)
> lastWord.opcode wants to be an inline opcode() function I think.

It is, and it's used outside class LIns, but not within.  For get functions like that I'm always in two minds whether to use them inside the class or not... but I'm happy to do so here.
Either way is fine. opcode() seemed clearer to me even inside the class, and future renamings firstWord->lastWord will only require patching one line :)
(In reply to comment #17)
> graydon, edwsmith, any news on reviews here?

Apologies, it's been on my list all week, keeps getting bumped. Reviewing now.
Comment on attachment 383880 [details] [diff] [review]
patch updated to address Andreas' comments

Overall, delightful clarification and cleanup. I particularly like the long
ASCII-art comment, and movement towards keeping more of the pointer
arithmetic class-private. No substantive objections.

Three style nits and one question.

Nit 1: There are a variety of vertically-aligned initializations made here,
I think that's non-house style.

Nit 2: There are a number of cases of saying "ins ->foo()" where you've
left whitespace before the ->. Those should go.

Nit 3: There are a bunch of assertions about type sizes that can be made
into static assertions using nano_static_assert(). Though you're just
following style used elsewhere, I see, so maybe we can just do a secondary
pass that converts all possibly-static asserts in one go. Whatever you
prefer.

Question: I don't know what you're basing the assumption of always landing
on a multiple of word-size on. You've got explicit 8 bit and 32 bit members
inside the various LIR kinds. Are you just assuming that the void* at the
end will force most sane compilers to align to a word multiple? I imagine
that's true but I'm not sure it's guaranteed anywhere. I'm also curious
what would happen if you relaxed that requirement and asked the compiler to
pack the structures. It'd probably slow down, but maybe the cache win would
compensate? Or is the word-multiple nature an important invariant?

Anyway, aside from that, if it passes the try servers (which, I believe,
will include trace-tests on all platforms as of this afternoon, whee!) I'm
happy with it.
Attachment #383880 - Flags: review?(graydon) → review+
(In reply to comment #22)

> Nit 3: There are a bunch of assertions about type sizes that can be made
> into static assertions using nano_static_assert(). Though you're just
> following style used elsewhere, I see, so maybe we can just do a secondary
> pass that converts all possibly-static asserts in one go. Whatever you
> prefer.

I wasn't aware of NanoStaticAssert;  that's much better as the assertions can be placed right next to the class definitions.  I'll change it.

> Question: I don't know what you're basing the assumption of always landing
> on a multiple of word-size on. You've got explicit 8 bit and 32 bit members
> inside the various LIR kinds. Are you just assuming that the void* at the
> end will force most sane compilers to align to a word multiple? I imagine
> that's true but I'm not sure it's guaranteed anywhere. I'm also curious
> what would happen if you relaxed that requirement and asked the compiler to
> pack the structures. It'd probably slow down, but maybe the cache win would
> compensate? Or is the word-multiple nature an important invariant?

Hmm, good questions.

I'm assuming/requiring that all LInsXYZ types are a multiple of word size and are word-aligned when allocated.  On the size issue, I thought that the C/C++ standard required the padding/alignment but now I see that it's implementation-defined.  But it seems all normal compilers do the obvious thing (align according to size), which is what I want.  And the assertions in LirBufWriter() check this.  In an earlier version of the wide-LIR code I had explicit padding, it's a bit tedious and verbose because you have to jump through some hoops to handle 32-bit and 64-bit machines.  I could add it back in if you think it's important.

As for whether LInsXYZ objects themselves will be word-aligned... it partly relies on pages themselves being word-aligned.  Pages are ultimately allocated by GCHeap::Alloc().  On Unix it uses mmap, which is guaranteed to be word-aligned.  On Windows it uses VirtualAlloc(), which I don't know about but would assume is word-aligned (it'd be weird if it wasn't).  On other systems it uses valloc() which is guaranteed to be word-aligned.  I'll add an assertion to makeRoom() which checks that the space allocated for each instruction is word-aligned.  I'll also augment the big comment to explain that instructions are word-aligned.

Beyond that, makeRoom() checks that the asked-for size is a multiple of word-size, so if the page is word-aligned then all the instructions within will be.  (insSkip() also rounds up payload sizes, but it goes through makeRoom() eventually so that is checked anyway.)

I believe that all of the above ensures that instructions are multiples of word size and word-aligned.  As for whether that's necessary... probably not, but I think trying to pack it tighter is a bad idea.

- On 32-bit machines, only LInsC and LInsP would be smaller.  I'm not seeing any noticeable speed-up with the current patch, and it reduces LIR size by something like 25%.  I think this would make a much smaller difference (1%?) and there's the risk of possible slowdowns due to unaligned values.

- I'm not sure how portable packing pragmas are.  I know GCC has attribute "packed" but I don't know about other compilers.

- Do any of the architectures we run on require alignment of any values?  I have a vague idea that this is so (is that why we store 64-bit immediates in two 32-bit chunks?), but I'm not sure.

Thanks for the comments!


Re the try servers:  do they run a debug build?  I certainly hope so, I want assertions to be on...
(In reply to comment #22)

> Nit 1: There are a variety of vertically-aligned initializations made here,
> I think that's non-house style.

Does Nanojit even have a house style?  I can unalign, but I think it's a shame, as I personally find this:

        LInsOp2* insOp2 = (LInsOp2*)_buf->makeRoom(sizeof(LInsOp2));
        LIns*    ins    = (LIns*)&insOp2->ins;

much easier to read than this:

        LInsOp2* insOp2 = (LInsOp2*_buf->makeRoom(sizeof(LInsOp2));
        LIns* ins = (LIns*)&insOp2->ins;

And what about the toLInsXYZ() functions and the assertions in LirBufWriter()?  Vertical alignment helps even more in those cases, enough that IMO it becomes a correctness issue...


> Nit 2: There are a number of cases of saying "ins ->foo()" where you've
> left whitespace before the ->. Those should go.

Ok, I admit that is a bit strange and doesn't help a lot.  I won't grumble about removing those ones :)
> as I personally find this:
> 
>         LInsOp2* insOp2 = (LInsOp2*)_buf->makeRoom(sizeof(LInsOp2));
>         LIns*    ins    = (LIns*)&insOp2->ins;
> 
> much easier to read than this:
> 
>         LInsOp2* insOp2 = (LInsOp2*_buf->makeRoom(sizeof(LInsOp2));
>         LIns* ins = (LIns*)&insOp2->ins;

+1 for allowing aligned declarations when appropriate.  I find it makes
code a bit easier to read and verify.
> (In reply to comment #22)
> 
> > Nit 1: There are a variety of vertically-aligned initializations made here,
> > I think that's non-house style.
> 
> Does Nanojit even have a house style? 

Come on now, that's below the belt. :)

(In reply to comment #25)
> 
> +1 for allowing aligned declarations when appropriate.  I find it makes
> code a bit easier to read and verify.

A variable-width LIR bug is not the place to start a style debate. Though I can see the appeal, I can't recall any aligned declarations in Mozilla code.(In reply to comment #24)
We align member declarators to start in the same column in structs and classes.

The same considerations can apply for locals, especially with favored C++ "declare at latest dominating initialization" style. Historically SpiderMonkey was C (I realize Nanojit was always C++) so we declared early -- but we didn't align declarators because we didn't initialize, since it is harder to read a forest of initialized C declarations all hoisted and bunched at the top of a function.

I say give Nick a little freedom on this one, esp. since Julian +1'd. I doubt Ed would demur.

/be
(In reply to comment #27)

> I say give Nick a little freedom on this one, esp. since Julian +1'd. I doubt
> Ed would demur.

*Shrug* I don't care much; my own code tends towards vertical alignment too, and I agree it's easier to read. I was just making note to fit in with the surroundings.

We can have a real Style Discussion some other day. Like when we fix all the whitespace :)

Re: alignment/packing issues, satisfactory answers. With the static asserts and a pass of the try servers, we're working everywhere that matters and checking for cases where we might fail, so it's fine.

I'm not sure if any of the try servers do debug builds. Static asserts, however, are IIRC turned-on in all cases. Being static dead-code and all.

If you've not done so yet, now's a good time to learn to use "push-to-try" to submit an MQ-stored patch to the try service. A single command from your workspace! The technique is documented in http://blog.mozilla.com/jorendorff/2008/08/18/push-to-try/ and https://wiki.mozilla.org/Build:TryServer, and they even have the civility to email you notices of success or failure.
Attached patch updated patch (obsolete) — Splinter Review
Updated patch.  Changes:
- It needed a sizeable rebase because of 500053 which introduced set*()
  functions for class LIns.  The interaction between LIns, LInsXYZ and
  LirBufWriter changed significantly, and the set*() functions became
  init*() functions and also changed substantially.  But it's basically just
  plumbing changes, albeit a lot of them, and results in fewer friend
  declarations.

- Updated comments about word-sizing/word-alignment.

- Used NanoStaticAssert where applicable.

- Used opcode() where possible instead of lastWord.opcode.

- Got rid of spaces before "->".

It passes everything on the try servers, and trace-test.js passes on my ARM
VM.

Performance-wise, I get a 6ms slowdown on Linux and a 4ms slowdown on Mac.
(I'll post the Sunspider results shortly.)  It's unclear why.  I did some
digging with Cachegrind, it reports that the number of D2 cache write misses
dropped by about 5%, and other figures (including instruction counts) were
barely any different.

(Nb: Ed, if you want to make any comments please do so soon!)
Attachment #383880 - Attachment is obsolete: true
Attachment #383880 - Flags: review?(edwsmith)
Blocks: 501232
I updated to TM r29661 and reran the tests.

- Mac, --runs=100, attempt 1:  1ms slowdown
- Mac, --runs=100, attempt 2:  25ms slowdown (!)
- Mac, --runs=100, attempt 3:  1ms slowdown

- Linux, --runs=1000, attempt 1: 1.7ms slowdown

I don't think my patch changed at all since the tests I did in comment 30 and comment 31.

Given these results, and the nature of the patch (less wasted memory should only help performance in the long run) I'm prepared to declare victory and land it.  Objections?
Let's look a little longer. We officially have a "no performance regressions" policy here. It's true that we might be measuring wrong, but we're all measuring a slowdown, minimal though it may be. Any word on the cachegrind branch-miss numbers?
Hmm, here's a plausible explanation.  Cachegrind says that the switch in LirReader::read() causes 132,168 indirect branch mispredictions, which is an increase of 4%.  Both versions involve a switch, but in the old code, in most cases the action was the same, because all instructions were the same width, so if the branch predictor was always predicting the common case it would do ok.
In the new code, the switch is much less predictable.

I'll try changing it to use some kind of table, hopefully that'll fix it.

Thanks to Graydon for suggesting the branch-prediction profiling!
This patch uses a mostly table-driven approach in read() -- the size of each LIR instruction is stored in a 128-byte table which is accessed via sizeof_LInsXYZ().  read() uses sizeof_LInsXYZ() for all the cases except the weird ones (start, skip, call).

To do this I had to change the names of the enum LInsRepKind members slightly, and also how the instruction sizes are repesented in LIRopcode.tbl.

I also fixed staticSanityCheck():  it was using NanoAssert rather than NanoStaticAssert and thus the assertions weren't being run.

On Mac, I consistently get a 2ms speedup now.  On Linux I get no change.  I also used Cachegrind to confirm that the branch mispredictions in read() are gone.  (If anyone has access to a small device and wants to try it I'd be interested to know how the performance compares.)
Attachment #385695 - Attachment is obsolete: true
Attachment #385951 - Flags: review?(graydon)
(In reply to comment #35)
> 
> This patch uses a mostly table-driven approach in read() -- the size of each
> LIR instruction is stored in a 128-byte table which is accessed via
> sizeof_LInsXYZ().  read() uses sizeof_LInsXYZ() for all the cases except the
> weird ones (start, skip, call).

Oh, sizeof_LInsXYZ() got removed;  now there's just the table insSizes[].
Attachment #385951 - Flags: review?(graydon) → review+
Comment on attachment 385951 [details] [diff] [review]
patch avoiding branch mispredictions in read()

Looks good, thanks for finding the bottleneck.
Backed-out because the tree was closed for Talos maintenance :(

http://hg.mozilla.org/tracemonkey/rev/7d1066ec75a1
Whiteboard: fixed-in-tracemonkey → checkin-needed
Re-pushed.

http://hg.mozilla.org/tracemonkey/rev/d6f07972cdc1
Whiteboard: checkin-needed → fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/ae16e5919d19
Status: REOPENED → RESOLVED
Closed: 16 years ago15 years ago
Resolution: --- → FIXED
on style, yeah, defer debate, and since there's no strict rule (yet), readability wins, and at least local consistency is important.  (no need to go indenting everything all at once).  i wouldn't rip out init alignment just to be conformist.

similar for converting asserts to static asserts - if you're touching it anyway, make it as right as you can (static assert).
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: