586297 - YARR: Get the ARM branch patcher to handle 'B'-style branches.

Reporter

Description

•

14 years ago

I just disabled the ability of fixUpOffsets to promote "LDR pc, =addr" to "B addr" because the branch patcher doesn't know what to do with a branch instruction: http://hg.mozilla.org/projects/jaegermonkey/rev/13495e9f957e

This ability should be re-instated, and we should also allow it to be more aggressive because in JM we have a lot of short-range branches between fast and slow paths. (In JSC, this probably wasn't the case.) I would have already done that, but it's non-trivial. Once a branch is made into a 'B', it is hard to find its literal pool entry again in case we need to promote it back to an LDR.

Jacob Bramley [:jbramley]

Reporter

Comment 1

•

13 years ago

Hmm. The trivial scheme of encoding an offset to a pool entry in a NOP after every branch seems to cost about 5% of performance on Kraken, with "-m". I didn't measure any difference on V8, however. In any case, a more sophisticated scheme of logging the pool entries may be more successful.

Jacob Bramley [:jbramley]

Reporter

Comment 2

•

13 years ago

Attached patch Use B branches in the patcher. (obsolete) — Details — Splinter Review

The attached patch is very crude and definitely not complete, but I have
some preliminary results so I'm attaching it as a work-in-progress.
Here's how it works:

When an LDR is optimized to be a B, its literal pool slot becomes and
element in a linked list. The linked list is restricted to the bounds of
the literal pool. In addition, the initial element of every literal pool
is the first element in one of these lists, and is an invalid
instruction encoding so we can find the pools.

In addition to be part of a linked list, each unused pool slot also
contains an offset to the LDR or B to which it originally belonged to.
The linked list offsets and LDR offsets are all packed into the 32-bit
pool slot: 11 bits for each linked list offset and 10 bits for the LDR
offset. The LDR offset doesn't need a sign bit as it's always negative.
10 bits (or 11 bits signed) is always enough; an LDR has a 12-bit
offset, but ours are always 4-byte aligned.

Here's how LDR branches are optimized to relative B branches:
   * Calculate the required branch offset (since B is a relative branch)
     and write the B instruction.
   * Synchronize the I and D caches, since we modified the instruction
     stream.
   * Scan forward from the LDR/B to the next literal pool. Literal pools
     in JM are always _after_ the associated load, so this will never
     fail. Also, because the first element in the pool is known not to
     be a valid instruction, we'll never get false positives etc.
   * The first element in the pool also forms an element in the linked
     list of pool slots, so we can use this to insert the newly-freed
     pool slot into the list.

Here's how B branches are de-optimized to long-range LDR branches:
   * Scan forward to find the literal pool, as before.
   * Walk through the linked list of unused slots until a matching slot
     is found.
   * Remove the slot from the list and put the new branch target into
     it.
   * Patch the B instruction to the relevant LDR instruction.
   * Synchronize the I and D caches, since we modified the instruction
     stream.

The patching process is quite slow for patches that swap between B and
LDR, but reasonably fast for B-to-B patches and LDR-to-LDR patches.
B-to-B patching requires a cache operation, but LDR-to-LDR behaves
exactly as it does now.

With '-m' alone, I measured a 4.4% performance improvement on Kraken.
I think the overhead of scanning for the pool is about 1.5% overall.
However, I think there's more potential here, and I'd like to try some
further improvements:

   * I currently only optimize branches when they are patched. Branches
     that never get patched (after code finalization) don't benefit from
     this.

   * I accidentally tested the patch with those ((noinline)) attributes
     in place. They're there to aid debugging. I might get marginally
     better performance without them.

   * It should be possible to cache the literal pool base at least, to
     remove the need to scan for the literal pool on each patch. This is
     not quite as simple as it sounds, however, as checking the validity
     of the cache is tricky. (I have a couple of ideas.)

   * We also store patchable literals in the literal pools. It would be
     trivial to make some of these patchable using the same method as
     the branch patching. (ARM can do some limited immediate-loads in a
     single instruction.)

   * If we know that a branch will be short-range when we generate it,
     we could have a method in the back-end to ignore the literal pool
     for these branches altogether. This would save space in the
     literal pools and avoid the code-scanning overheads.

   * If we can give a hint to the back-end that a branch will never need
     to be patched again, we can skip all the scanning and messing
     about, since we will never need to find the literal pool slot
     again. I don't know how often this would be useful in practice.

   * When patching a large batch of branches, it would probably be wise
     to queue up several cache synchronization calls and issue them all
     at once. Each synchronization requires a call into the kernel, so
     they aren't especially quick.

I did look at patching our function call sequences (which do "LDR r8,
=addr; BLX r8"), but the calls never seem to be within range of a simple
BL or BLX. Our function calls are usually to C code, which gets
allocated from a different address base. This might change if with bug
615487.

Whilst it might be more time-efficient to store pointers to the literal
pool slots in the IC structures (for example), this implementation is
more space-efficient, and because the painful patches are relatively
rare I though the space-efficiency would be more important. Also, this
solution is entirely contained within Nitro and thus will benefit
IonMonkey as well as JaegerMonkey, with little (if any) extra effort.

Jacob Bramley [:jbramley]

Reporter

Comment 3

•

13 years ago

I've got this up to a 6.5% performance boost on Kraken! I'll post a patch and more details tomorrow. I've done the following of my suggested improvements:

 [X] Include branches that are marked unpatchable.
 [X] Turn off ((noinline)).
 [X] Cache the last literal pool search results.
 [X] Also optimize non-branch loads.

I've not done the following:
 [ ] Allow method-JIT code to hint that a branch will never need repatching.
 [ ] Queue up cache synchronizations and issue them in batches.

The remaining tasks are promising, but fiddly. They might be better added as separate patches.

Marty Rosenberg [:mjrosenb]

Comment 4

•

13 years ago

> Whilst it might be more time-efficient to store pointers to the literal
> pool slots in the IC structures (for example), this implementation is
> more space-efficient, and because the painful patches are relatively
> rare I though the space-efficiency would be more important. Also, this
> solution is entirely contained within Nitro and thus will benefit
> IonMonkey as well as JaegerMonkey, with little (if any) extra effort.


Do you mean that the IC structures will be less space efficent?
I think we can get away with not increasing the size of the IC structure.
Rather than have one pointer that points at the instruction, and one pointer that points at the slot in the pool.  We will always be able to get one from the other.  If the instruction is currently a LDR, in the IC store a pointer to the instruction, and can recover the slot by the offset stored in the instruction.  If the instruction is a branch, then we want to store a pointer to the slot in the pool (with 0x1 or'ed in).  We can recover the address of the branch by having a pointer to it in the slot (or enough bits to recover it).

Also, we may want to implement a fix to bug 675342 in the patching code.  Since the "branch" instruction doesn't actually need to be a branch, we can simply output a nop rather than a branch to the next instruction.

Jacob Bramley [:jbramley]

Reporter

Comment 5

•

13 years ago

(In reply to comment #4)
> > Whilst it might be more time-efficient to store pointers to the literal
> > pool slots in the IC structures (for example), this implementation is
> > more space-efficient, and because the painful patches are relatively
> > rare I though the space-efficiency would be more important. Also, this
> > solution is entirely contained within Nitro and thus will benefit
> > IonMonkey as well as JaegerMonkey, with little (if any) extra effort.
> 
> 
> Do you mean that the IC structures will be less space efficent?
> I think we can get away with not increasing the size of the IC structure.
> [...]

Yep, that could work. My current patch doesn't store anything in the IC struction at all, but does everything within the back-end. (This makes it applicable to IonMonkey for free.) It could easily be adapted, however.

> Also, we may want to implement a fix to bug 675342 in the patching code. 
> Since the "branch" instruction doesn't actually need to be a branch, we can
> simply output a nop rather than a branch to the next instruction.

Yep, that would be trivial to add to my patch so I'll give it a shot now. I didn't realize there were so many pointless branches!

Jacob Bramley [:jbramley]

Reporter

Comment 6

•

13 years ago

(In reply to comment #4)
> Rather than have one pointer that points at the instruction, and one pointer
> that points at the slot in the pool.  We will always be able to get one from
> the other.

Bear in mind that many pointers in ICs are actually offsets from a base pointer. These offsets are usually single bytes, and are hard-coded in many cases to save memory. Re-using the branch label offset to point to the literal pool is possible, but maybe not as easy as it sounds and probably not free of memory overheads.

Also, this would need to be done to every IC individually because they don't share a lot of code. It would be a big patch, _but_ it might win another 1.5% or so of performance, so it's worth doing. I'd propose investigating it as a separate bug, however. This one covers enough already.

Jacob Bramley [:jbramley]

Reporter

Comment 7

•

13 years ago

Latest news:

TEST                         COMPARISON    
                                           
===========================================
                                           
** TOTAL **:                 1.186x as fast
                                           
===========================================
                                           
  ai:                        -             
    astar:                   -             
                                           
  audio:                     1.186x as fast
    beat-detection:          1.198x as fast
    dft:                     1.149x as fast
    fft:                     1.25x as fast 
    oscillator:              1.162x as fast
                                           
  imaging:                   1.25x as fast 
    gaussian-blur:           1.27x as fast 
    darkroom:                1.32x as fast 
    desaturate:              1.162x as fast
                                           
  json:                      ??            
    parse-financial:         -             
    stringify-tinderbox:     ??            
                                           
  stanford:                  1.052x as fast
    crypto-aes:              -             
    crypto-ccm:              -             
    crypto-pbkdf2:           1.077x as fast
    crypto-sha256-iterative: -             

Some tests are much faster, others are about the same. I ran with "-m" in both cases, to factor out trace-jit and profiling behaviour.

Patch coming shortly...

Jacob Bramley [:jbramley]

Reporter

Comment 8

•

13 years ago

Attached patch Optimize literal pool loads after code finalization. (obsolete) — Details — Splinter Review

This version covers the features I mentioned previously, and also the redundant branches as discussed in bug 675342. The performance gain is surprising. I've seen 6-8% consistently but I saw much more than that after incorporating Marty's suggestion from bug 675342 (as in my last comment).

The patch is still full of debug code and needs cleaning up, which is why I haven't asked for a review.

Attachment #546151 - Attachment is obsolete: true

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 9

•

13 years ago

Marty asked how this will port over to IonMonkey (I think I've got the right bug), and the problem sounds strikingly familiar to what happened on x64.

For JM we had ran into a problem where the 64-bit address range could overflow the 32-bit offset in jumps. As a quick fix we inserted checks everywhere to disable ICs when this happened.

The story in IonMonkey (which does not yet have ARM support) is a little different. Code is GC'd, so we need to be able to trace its jumps to ICs and other scripts. We achieve this by appending a compressed relocation table to each code object.

On x86, the relocation table is just a list of offsets to all external jump instructions. To trace JIT code, we sniff the target of each jump.

On x64, for each external jump, we reserve 16 bytes in a pool at the end of the code. If we ever need to patch the jump, and the target does not fit in a 32-bit offset, we redirect the jump into that pool. The relocation table has both the position of the small and large jump, and the trace hook figures out which one is active.

Tentatively I was planning for ICs to store their jump offsets as indexes into the relocation buffer (these indexes will be small), which would also give the patching mechanism all the information it needs.

I have no idea whether any of this could apply to ARM. But since it will need to trace its external jumps during GC, that might expose a patching solution that doesn't need to search for and through constant pools.

Jacob Bramley [:jbramley]

Reporter

Comment 10

•

13 years ago

Attached patch Version 2. (obsolete) — Details — Splinter Review

Attachment #549780 - Attachment is obsolete: true

Attachment #551055 - Flags: review?

Jacob Bramley [:jbramley]

Reporter

Comment 11

•

13 years ago

Attached patch Version 2.1 (obsolete) — Details — Splinter Review

(Sorry, that previous one was a mistaken form submission before I'd qrefreshed.)

This one's ready, I think. I didn't have time to benchmark it but I'll get some results on Monday. It passes all the jit-tests with '-m'.

There are still some things that I'd like to do, but I'd rather get this pushed first:

   * Immediate loads that _aren't_ branches are optimized when patched, but only if they're patched after the code is finalized. Fixing this looked trivial but quickly got messy, so I've postponed it. Most patchable loads are for branches anyway.

   * I'd like to move as much of the code as I can out of the header file and into the CPP file. This caused complications when I first tried it, though it shouldn't really be terribly difficult. However, I'd like to do the same to much of the header files. If nothing else, any one-line modification triggers a huge build process that should really be unnecessary, and inlining doesn't help an awful lot anyway.

Attachment #551055 - Attachment is obsolete: true

Attachment #551060 - Flags: review?(mrosenberg)

Attachment #551055 - Flags: review?

Jacob Bramley [:jbramley]

Reporter

Comment 12

•

13 years ago

Attached patch Version 2.2. (It really is the right one this time.) — Details — Splinter Review

Attachment #551060 - Attachment is obsolete: true

Attachment #551062 - Flags: review?(mrosenberg)

Attachment #551060 - Flags: review?(mrosenberg)

Marty Rosenberg [:mjrosenb]

Comment 13

•

13 years ago

Comment on attachment 551062 [details] [diff] [review]
Version 2.2. (It really is the right one this time.)

Review of attachment 551062 [details] [diff] [review]:
-----------------------------------------------------------------

My only nit with this is in the comments.  Could you specify the variant of BLX that is being referred to?	
e.g.
>        // *insn. If *insn is a BLX instruction preceeded by an LDR, that LDR

From context, it is obvious which one we're talking about, but it would be nicer if it was explicit.

And in a broader view, You should add in a comment describing why having the unused slots refer back to the instruction that they originated from is necessary.  It took me a couple of minutes to figure out why we couldn't just find *any* unused slot, and grab it.
I have some bigger issues with the entire scheme for linking the unused slots back to their instructions, but that can be fixed later.  I'd prefer to get the patch in sooner, and make it algorithmically better later.

Attachment #551062 - Flags: review?(mrosenberg) → review+

Jacob Bramley [:jbramley]

Reporter

Comment 14

•

13 years ago

(In reply to Marty Rosenberg [:Marty] from comment #13)
> Comment on attachment 551062 [details] [diff] [review] [diff] [details] [review]
> Version 2.2. (It really is the right one this time.)
> 
> Review of attachment 551062 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> My only nit with this is in the comments.  Could you specify the variant of
> BLX that is being referred to?	
> e.g.
> >        // *insn. If *insn is a BLX instruction preceeded by an LDR, that LDR
> 
> From context, it is obvious which one we're talking about, but it would be
> nicer if it was explicit.

I'll update the comment.

> And in a broader view, You should add in a comment describing why having the
> unused slots refer back to the instruction that they originated from is
> necessary.  It took me a couple of minutes to figure out why we couldn't
> just find *any* unused slot, and grab it.

Actually, one of the things I'd like to do to improve it is to patch loads to use the last empty slot. Because pools are always positioned after the loads, this still guarantees that we have available slots for every B-to-LDR patch. We could then store a bit array at the start of each pool to avoid the expensive linked-list traversal. It doesn't solve the problem of finding the literal pool in the first place, though. That will require some pointer to be stored outside the code. However, as dvander pointed out, we can probably get all this for free in IonMonkey so it might not be worth worrying about too much in the context of JaegerMonkey.

> I have some bigger issues with the entire scheme for linking the unused
> slots back to their instructions, but that can be fixed later.  I'd prefer
> to get the patch in sooner, and make it algorithmically better later.

Agreed.

Jacob Bramley [:jbramley]

Reporter

Comment 15

•

13 years ago

(In reply to David Anderson [:dvander] from comment #9)
> I have no idea whether any of this could apply to ARM. But since it will
> need to trace its external jumps during GC, that might expose a patching
> solution that doesn't need to search for and through constant pools.

Yep, it looks like this system could be used for much of the patching on ARM.

We still have some other patchable things in JM, such as shape guards and the like, where we need to patch a 32-bit value that isn't necessarily a branch. Presumably, we could use the same mechanism for these.

Also, it will be interesting to see how literal pools will be handled in IM (if this has been considered yet). They will need to be attached to the GC-able chunks so depending on the size of the chunks, some attention might need to be paid to how we handle them. Depending on the constants we most often need to handle, it might be sensible to do away with literal pools altogether. In the worst case on ARMv6 (and below), it takes 4 instructions to load a 32-bit value, though most real constants are probably much simpler and on ARMv7 we can load any value in 2 instructions (MOVW+MOVT).

Jacob Bramley [:jbramley]

Reporter

Updated

•

13 years ago

Whiteboard: [inbound]

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 16

•

13 years ago

http://hg.mozilla.org/integration/mozilla-inbound/rev/e1aae4baeaa4

Matt Brubeck (:mbrubeck)

Comment 17

•

13 years ago

Backed out because of test failures on Android:
https://hg.mozilla.org/integration/mozilla-inbound/rev/29e59859d415

Whiteboard: [inbound]

Jacob Bramley [:jbramley]

Reporter

Updated

•

13 years ago

Depends on: 702278

Till Schneidereit [:till]

Comment 18

•

11 years ago

Any relevancy for IM, or INVALID?

Flags: needinfo?(mrosenberg)

Marty Rosenberg [:mjrosenb]

Comment 19

•

11 years ago

The 'JM' tag on this is a bit of a misnomer.  It is technically an improvement to the JSC/arm assembler, which is only used by the JM compiler... *and* YARR. IIRC, when I was looking at the code that YARR created, there were a metric ton of branches (usually load, compare branch, repeat), so this patch has the ability to improve perf on regexp-based tests (and nothing else).  It should be pretty easy to determine if it helps the one or two regexp tests we care about, and discard the patch if it doesn't actually help anything.

Flags: needinfo?(mrosenberg)

Till Schneidereit [:till]

Comment 20

•

11 years ago

Ok, renaming the bug, then.

@jbramley, I unassigned you, but feel free to revert that if you are, in fact, going to work on it. :)

Assignee: Jacob.Bramley → general

OS: Linux → All

Summary: JM: Get the ARM branch patcher to handle 'B'-style branches. → YARR: Get the ARM branch patcher to handle 'B'-style branches.

Till Schneidereit [:till]

Comment 21

•

11 years ago

hrmpf

Status: ASSIGNED → NEW

Chris Peterson [:cpeterson]

Comment 22

•

10 years ago

Marty: is this ARM bug still relevant now that YARR has been with irregexp in Firefox 32 (bug 976446)?

Flags: needinfo?(mrosenberg)

Keywords: perf

Marty Rosenberg [:mjrosenb]

Comment 23

•

10 years ago

It shouldn't be.

Flags: needinfo?(mrosenberg)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 24

•

10 years ago

(In reply to Marty Rosenberg [:mjrosenb] from comment #23)
> It shouldn't be.

WONTFIX as per comment 23.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Use B branches in the patcher. 13 years ago Jacob Bramley [:jbramley] 28.39 KB, patch		Details \| Diff \| Splinter Review
Optimize literal pool loads after code finalization. 13 years ago Jacob Bramley [:jbramley] 42.41 KB, patch		Details \| Diff \| Splinter Review
Version 2. 13 years ago Jacob Bramley [:jbramley] 42.41 KB, patch		Details \| Diff \| Splinter Review
Version 2.1 13 years ago Jacob Bramley [:jbramley] 42.41 KB, patch		Details \| Diff \| Splinter Review
Version 2.2. (It really is the right one this time.) 13 years ago Jacob Bramley [:jbramley] 51.84 KB, patch	mjrosenb : review+	Details \| Diff \| Splinter Review