Closed Bug 488775 Opened 12 years ago Closed 12 years ago

TM: widen LIR instructions

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(4 files, 6 obsolete files)

Each LIns is currently 4 bytes wide.  The compactness is admirable, but makes various things difficult:
- immediates don't fit in a normal LIns
- operands can only point up to 256 instructions backwards;  any more and they
  need a special trampoline instruction in the LIR
- probably some other annoyances

On Andreas' suggestion, I'm planning to try to widen each LIns to 4 words.  Then operands will be able to be normal pointers to previous instructions, and more immediates (and similarly large arguments) will fit in normal LIns instructions.  This should make the code simpler and hopefully also faster -- less fooling around with decompaction will hopefully overcome the increased memory usage.

This bug is for tracking this.  I haven't got very far yet, I'm still hunting down all the places where nanojit assumes an LIns is 4 bytes :(
Attached patch Patch to widen LIR (obsolete) — Splinter Review
This patch widens LIR.

- Each LIns is now 4 words, rather than 4 bytes.

- Operands are pointers to prior instructions, rather than (small) offsets
  to those instructions.  This means that trampolines (LIR_tramp and
  LIR_neartramp) and their related machinery are no longer necessary.

- LIR_nearskip is no longer necessary, because the size parameter can fit in
  a single LIns.

- LIR_short is no longer necessary, because LIR_int can cover both 16-bit
  and 32-bit immediates.

- The following instructions which were previously larger than a single LIns
  now fit in a single LIns: LirFarIns, LirImm32Ins, LirImm64Ins, LirCallIns.
  (But LIR_call args still are stored next to the LIns.)

- No code assumes that an LIns has a particular size or layout (previously
  various places did).

- I renamed a few methods to be consistent with other, similar methods.

This simplifies the code quite a bit;  there are 220 fewer lines of code,
and the removal of the trampolines and the instruction kinds larger than a
single LIns is a big win in terms of complexity and readability.

Performance seems to be marginally worse, a 1.009x slow-down on SunSpider.
Presumably the slow-down from the greater cache pressure is slightly larger
than any speed-up due to avoiding trampolines.  I'll attach the full perf
results shortly.

Some things that are yet to be fixed:

- Only implemented for x86 -- the ARM, Thumb and SPARC back-ends haven't
  been updated.

Some things that could be further improved:

- Could put the first two args of a call inside the LIns, rather than next
  to it.  This would reduce the size of many LIR_call instructions (those
  calling functions with 1, 2, 5 or 6 arguments).

- The LIns fields could be moved around; there's lots of leeway in terms of
  where you can put the padding.
Attachment #374021 - Flags: review?(graydon)
Attached file SunSpider results (obsolete) —
Attachment #374022 - Attachment mime type: application/octet-stream → text/plain
I would suggest making a lot of the simplified functions inline, especially oprndX() and friends. That will provide some speedup I would guess.
Attachment #374021 - Flags: review?(edwsmith)
My hero! I will review this at some point -- because wow, do I ever want such a patch -- but it's a bit off the critical path for my blocker list at the moment.

(In reply to comment #3)
> I would suggest making a lot of the simplified functions inline, especially
> oprndX() and friends. That will provide some speedup I would guess.

Oh come on, there's no need to guess, just profile it and attack the hotspots. It's easy to stress-test.
Turns out I wasn't passing -j to sunspider via the --args flag, so I was measuring the interpreter!  (Thanks to Sayre for pointing out that my times seemed slow which led to me realising this.)

JIT measurements are attached.  There's now a 1.008x speed-up, yay!  And I'll do some profiling and see if I can improve it further.  (The measurements are of a 32-bit build on a 64-bit Ubuntu Linux box, BTW.)

As for why I'm getting a 1.009x slow-down when just running the interpreter... the only explanation I can think of is that the changed nanojit code caused memory to be laid out in a slightly different way which unluckily caused worse cache behaviour.  I reran the tests and got similar results, so that slow-down is repeatable.
Attachment #374022 - Attachment is obsolete: true
Notes from discussion with nick. We can make the reservation table go away in a follow up patch since we have more room now. Benefits:

1) failed assembly doesn't leave LIR in a bad state
2) no more running out of reservation space
3) LIR tells us where values live (register/stack), which can be used to make branch code more efficient (read values straight from the registers instead of going via memory)
- Avoided initialising unnecessary fields in initOpcodeAndClear(), this cut about 12 ms off SunSpider.  According to Cachegrind this is because fewer cache write misses were happening.
- Reordered some LIns fields and made various small functions inline, which didn't make much difference but shouldn't hurt.
- Removed LIR_st and LIR_stq -- widening the 'disp' field in LIR_sti and LIR_stqi meant they were no longer necessary.

I also tried inlining the first argument of LIR_calls into the LIns (not two as suggested above, because two won't fit if we enlargen the resv as Andreas suggested).  It didn't help and made things more complex so that change isn't in this patch.

I'll attach the new SunSpider results shortly, overall it's now 1.017x faster.
Attachment #374021 - Attachment is obsolete: true
Attachment #374021 - Flags: review?(graydon)
Attachment #374021 - Flags: review?(edwsmith)
Attachment #374198 - Flags: review?(graydon)
Attachment #374198 - Flags: review?(edwsmith)
Attachment #374167 - Attachment is obsolete: true
Attachment #374198 - Flags: review?(edwsmith) → review+
Some time ago we also prototyped an 8-byte LIns implementation, and it wasn't too hard to support both formats with minimal ifdeffing and some factoring.  Worth considering here?

I'm a bit surprised its not more of a speedup, for the 3-4x increase in IR size.  Would it be possible to isolate compile time in the benchmarks?
With respect to the 8-byte LIns implementation:  I think it's not easy, nor a good idea.  8 bytes would mean that normal operands couldn't be simple pointers, but would have to go back to offsets, which means you'd have to reintroduce trampolines.  (Or you could keep the simple pointers as operands, but then you couldn't represent a simple 2-operand instruction in a normal-width LIns, which would be crazy.)  Furthermore, LIR_quad instructions would have to be double-width again, and LIR_sti and LIR_stqi would have to be reintroduced.  In other words, it would reintroduce about 3/4 of the removed complexity.  Also, Andreas's resv simplification/optimisation (comment #6) probably wouldn't be possible.

In comparison, I can see that supporting 4-byte and 8-byte LIns at the same time with #ifdefs would be fairly easy, as the basic approach would be the same -- eg. you'd use trampolines in both -- and in the 8-byte version you'd probably have minor improvements like 16-bit offset operands, 16-bit sti.disp fields, and the ability to represent LIR_int in a normal-width instruction.  But once you've moved to 4-words and removed most of the machinery, it's difficult and undesirable to put it back.

As for the speedup, I'm confused by your comment:  surely the 3x-4x size increase should slow things down, if anything, due to greater cache pressure?  It's the simpler traversals and operations (eg. no trampoline chasing, and simple operations like oprnd1() can be inlined) that speed things up.  Although Andreas suggested this change for speed, I was viewing it as a win more for the complexity reduction, and any speedup as a bonus;  1.017x struck me as pretty good :)

Nonetheless, I can measure compile-time if anyone can tell me how.  Graydon's standalone LIR assembler (#484142) could be used for that, but maybe there's an easier way?
(In reply to comment #9)

> Some time ago we also prototyped an 8-byte LIns implementation, and it wasn't
> too hard to support both formats with minimal ifdeffing and some factoring. 
> Worth considering here?

Sorry, I wasn't clear.  Would it be worth factoring the code so the nanojit embedder can select between 4-byte and 16-byte LIns implementations at compile time?  

regardling 8-byte LIns, we're in agreement, i wasn't proposing we re-introduce that format.  Since we have 4-byte LIns that is slow/small/complex, and 16-byte LIns is big/fast/simple, then 8-byte LIns would be slow/bigger/complex, not worth the trouble.

MD5 is the best one to look at because it's compile time is so high, and the results are more impressive:

md5:    1.21x as fast      17.5ms +/- 2.2%    14.5ms +/- 2.6%     significant
I wonder whether compression LIR with standard compression algorithms is better than a 4-byte LIR. We have a brief period of intensive access (emit and compile), and then we barely ever touch the LIR again.
With Ed having granted the review, is it necessary for Graydon to do one as well?  He's welcome to, of course, but it's not clear to me what the next step in moving the bug forward is now.
(In reply to comment #13)
> With Ed having granted the review, is it necessary for Graydon to do one as
> well?  He's welcome to, of course, but it's not clear to me what the next step
> in moving the bug forward is now.

Eh. Blurry rules. On the one hand, adobe owns the tamarin repos and mozilla owns the tracemonkey ones. On the other hand, the long term goal (which I swear, I *will* get back to once we get 3.5 out the door) is to absorb all the nanojit code from each into the other. So we tend to assign review duties to whoever seems to best "know" the code in question.

I don't know who tagged this as dual-review-requiring; I've got some time today so I'm happy to look it over now, and it's sufficiently invasive that it seems reasonable to need 2 reviews. But I'm not sure there's a formal rule for when a particular reviewer or quantity-of-reviewers is appropriate.
Some review notes:

  - It would probably have been better to separate the unrelated renamings
    from this patch. Eg. s/skip/insSkip/, s/const/imm32/, etc. I don't 
    dislike them, but they clutter this one. Separate patches for separate
    tasks.

  - in insAlloc, you're still asserting that the allocation is a u16. There's no
    need; it'll now accept u32 allocations (however dubious that itself may be).
    Should still assert that the allocation is non-negative though. Similarly
    in setSize, in LIR.h, can loosen the condition.

  - I've never read constvalq (now imm64) before, but ... wow. Does it need to 
    be this complex? Can't we just return

      (uint64_t(i64.imm64_1) << 32) | uint64_t(i64.imm64_0)

    ? Like, just invert what we did when we inserted an imm64? What is with 
    the endian-ifdefing and unions? And AVMPLUS_UNALIGNED_ACCESS? I'm 
    surprised at all this machinery.

  - Aha. It is a copy-and-paste job from constvalf / imm64f, I see. Makes
    more sense, but I still think imm64 ought to remain as simple as possible.

  - If you're going to do as much renaming as all this, perhaps insImmf()
    should become insImm64f() for symmetry?

  - Oh! InsCall is so much nicer..

  - Naturally -- due to all the combined renaming :) -- you still have 
    to adjust all the other NativeFoo backends before this can land.

Otherwise, as the critics put it, "two thumbs way up". This is much needed cleanup. Thanks.
Attachment #374198 - Flags: review?(graydon) → review+
Thanks for the review.  I initially put Graydon down as the reviewer, and then Ed added himself as a second reviewer.

Comments:

- Yes, constvalq is ugly.  AVMPLUS_UNALIGNED_ACCESS is never defined, so I was tempted to remove it but wasn't sure about the invisible dependency on Tamarin.  Perhaps Ed can comment?

- And I saw last night that using a union to do a cast, as constvalq() does, is explicit called out as bad style by Stroustrup;  it's used in several places in nanojit (eg. LIns::insImmf()).  I'd be happy to change constvalq() to your simple definition.

- As for adjusting the other back-ends, Sayre said I definitely had to do the ARM back-end, but the Sparc back-end is looked after by someone at Sun and the Thumb back-end is dead...?  Is that right?  (Nb: even if I hadn't renamed things, the other back-ends would still be affected.)
All non-trivial changes to nanojit should ask for Ed's additional review. Our goal is to keep the code bases in sync. It helps if both sides show each other their cards before we do the next big round of merges.

For the TM side, anyone can review patches who feels qualified. For nanojit its often graydon, danderson or I (probably also dmandelin).
(In reply to comment #16)
> Thanks for the review.  I initially put Graydon down as the reviewer, and then
> Ed added himself as a second reviewer.

yeah, i wasn't sure of ettiqute here, but i wanted to mark this one to come back to later, so i R?'d myself (maybe there's a better way to do this)

> - Yes, constvalq is ugly.  AVMPLUS_UNALIGNED_ACCESS is never defined, so I was
> tempted to remove it but wasn't sure about the invisible dependency on Tamarin.
>  Perhaps Ed can comment?

that switch is there for cpu's that will fault on non-naturally aligned loads/stores.

> - And I saw last night that using a union to do a cast, as constvalq() does, is
> explicit called out as bad style by Stroustrup;  it's used in several places in
> nanojit (eg. LIns::insImmf()).  I'd be happy to change constvalq() to your
> simple definition.

this might have been an attempt to do type-punning in a way that gcc -fstrict-aliasing "blesses".  i'm fine with a fix if gcc is happy.

> - As for adjusting the other back-ends, Sayre said I definitely had to do the
> ARM back-end, but the Sparc back-end is looked after by someone at Sun and the
> Thumb back-end is dead...?  Is that right?  (Nb: even if I hadn't renamed
> things, the other back-ends would still be affected.)

sparc: leon.sha@sun.com
thumb: dead (code already removed in tamarin/nanojit)
ppc[64], x64, i386: Rick Reitmaier (rreitmai@adobe.com) and I maintain these in tamarin, happy to consult if there are questions.
> > - Yes, constvalq is ugly.  AVMPLUS_UNALIGNED_ACCESS is never defined, so I was
> > tempted to remove it but wasn't sure about the invisible dependency on Tamarin.
> >  Perhaps Ed can comment?
>
> that switch is there for cpu's that will fault on non-naturally aligned
> loads/stores.

The weird thing is it's never defined anywhere in the TM repository.  Maybe it is in Tamarin?

w.r.t. the back-ends, we don't even have the ppc and x64 back-ends in TM, so I don't know what to do about them.  And I still don't know what to do about Sparc.
With respect to UNALIGNED_ACCESS, using a union, etc., I wasn't strictly objecting to the use of unions (they're illegal by the book, but work on every compiler). I just think it's a bit over-engineered. I believe the following two definitions are agnostic wrt. both endianness and the the willingness of the target system to tolerate unaligned loads:

uint64_t LIns::constvalq() const
{
  LirImm64Ins* l = (LirImm64Ins*)(this-LIR_IMM64_SLOTS+1);
  return (uint64_t(l->v[1]) << 32) | uint64_t(l->v[0])
}

double LIns::constvalf() const
{
  union { uint64_t dst; double tmpf; } u;
  u.dst = constvalq();
  return u.tmpf;
}

So I was suggesting simplifying down to something like that.
Blocks: 490947
Blocks: 478301
I tried using scratchbox to do the ARM changes, but failed -- I'm getting mysterious seg faults on 'js' even though smaller programs work fine.  Vlad suggested I ssh to a mozilla machine with scratchbox on it, but that requires my ssh key be registered in LDAP (#491423), which I'm currently waiting for.
Depends on: 491423
Rebased the patch.  Sayre said he would test this on the try server (I can't because I don't have commit privileges yet.)  And I'll try to get the ARM backend fixed on vlad's machine soon.
Attachment #374198 - Attachment is obsolete: true
Attached patch rebased again (obsolete) — Splinter Review
Hopefully this will apply cleanly.
Attachment #376124 - Attachment is obsolete: true
You should be able to use the web interface (though not push-to-try) with your mozilla.com LDAP account; see https://wiki.mozilla.org/Build:TryServer
This patch has the appropriate changes to the ARM back end.  They're extremely similar to the changes in the x86 backend so I don't think they need review, but if anyone disagrees let me know.

I ran the unit tests under ARM emulation on 'scrapper' (which only takes 2.5 hours per run).  Results were the same after the patch was applied, so the patch works.

Now we're just waiting on the results of sayre sending the previous patch to the try server.  If that looks good it's ready to be landed.
Attachment #376142 - Attachment is obsolete: true
Just out of interest, do you know what effect this has on the size of the LIR code? ARM processors tend to have smaller caches than x86, so whilst you get a speed-up on x86, you may end up damaging performance on ARM. I know that a side-effect is that you can remove several instructions from the LIR instruction set, so it may still be beneficial, but it would be worth benchmarking it on ARM to see what effect it really has.

> 3) LIR tells us where values live (register/stack), which can be used to make
> branch code more efficient (read values straight from the registers instead of
> going via memory)

Nice! This is more painful on ARM; it has a fairly large register set so reducing the number of needless memory accesses would certainly be beneficial.

---

> [...] and the Thumb back-end is dead...?  [...]

Yes, the Thumb back-end is dead. I submitted a patch to remove it in bug 486535.
Jacob:  no idea about the LIR code size.  I don't know how to measure that.

As for benchmarking ARM:  I only have access to a machine running an ARM emulator (scratchbox) so I can't do ARM benchmarking.

It's possible that this change could hurt ARM performance, but it is an enabler for future changes such as the resv table improvement that you said yourself could avoid memory accesses.  It might also be possible to make the LIR smaller again -- most of the opcodes only use 2 or 3 words, so making each LIR instruction only as wide as necessary could gain back part of any loss.  (That might sound like it would be going back to what we had pre-patch, but it's not, because we're still variable-width now due to 'call' and 'skip' instructions.)
Whiteboard: needs-checkin
Keywords: checkin-needed
Whiteboard: needs-checkin
http://hg.mozilla.org/mozilla-central/rev/05ea1afb04f8
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Oh my. Landed pre-3.5? I thought the conversations about this converged on "save it for post-3.5"...

Oh well. It *is* a great set of improvements. Here's hoping it sticks! :)
It only landed on tracemonkey and trunk though, right? Nobody's pushing to land this on the 1.9.1 branch *for* 3.5.
3.5 over my dead body. This change the balance of LIR in the code cache, and might subtly change the amount of code we can have in the cache at once. I want to follow up with a patch to purge the LIR part of the code cache for fennec.
(In reply to comment #31)
> It only landed on tracemonkey and trunk though, right? Nobody's pushing to land
> this on the 1.9.1 branch *for* 3.5.

Correct.
Attached patch Patch for sparcSplinter Review
Attachment #376676 - Flags: review?(gal)
Leon, I think these two lines aren't necessary (ie. p is dead):

                 const int32_t* p = (const int32_t*) (value-2);

                 const int32_t* p = (const int32_t*) (ins-2);
(In reply to comment #31)
> It only landed on tracemonkey and trunk though, right? Nobody's pushing to land
> this on the 1.9.1 branch *for* 3.5.

Ah phew, misread. Thanks, glad to hear!
Attachment #376676 - Flags: review?(gal) → review+
Comment on attachment 376676 [details] [diff] [review]
Patch for sparc

Thanks!
FWIW, I was looking back at how TM performance on SunSpider has changed over a long time because of bug 515871.  And I stumbled across this patch again when measuring some regressions for sha1.  Turns out this patch undid a big chunk of a previous sha1 regression by speeding it up by 1.111x.  

In fact, my measurements done today say it caused a 1.033x speedup on my MacBook Pro, which is pretty hefty for a single patch.  Results are attached for anyone who's interested.
You need to log in before you can comment on or make changes to this bug.