I just disabled the ability of fixUpOffsets to promote "LDR pc, =addr" to "B addr" because the branch patcher doesn't know what to do with a branch instruction: http://hg.mozilla.org/projects/jaegermonkey/rev/13495e9f957e This ability should be re-instated, and we should also allow it to be more aggressive because in JM we have a lot of short-range branches between fast and slow paths. (In JSC, this probably wasn't the case.) I would have already done that, but it's non-trivial. Once a branch is made into a 'B', it is hard to find its literal pool entry again in case we need to promote it back to an LDR.
Hmm. The trivial scheme of encoding an offset to a pool entry in a NOP after every branch seems to cost about 5% of performance on Kraken, with "-m". I didn't measure any difference on V8, however. In any case, a more sophisticated scheme of logging the pool entries may be more successful.
Created attachment 546151 [details] [diff] [review] Use B branches in the patcher. The attached patch is very crude and definitely not complete, but I have some preliminary results so I'm attaching it as a work-in-progress. Here's how it works: When an LDR is optimized to be a B, its literal pool slot becomes and element in a linked list. The linked list is restricted to the bounds of the literal pool. In addition, the initial element of every literal pool is the first element in one of these lists, and is an invalid instruction encoding so we can find the pools. In addition to be part of a linked list, each unused pool slot also contains an offset to the LDR or B to which it originally belonged to. The linked list offsets and LDR offsets are all packed into the 32-bit pool slot: 11 bits for each linked list offset and 10 bits for the LDR offset. The LDR offset doesn't need a sign bit as it's always negative. 10 bits (or 11 bits signed) is always enough; an LDR has a 12-bit offset, but ours are always 4-byte aligned. Here's how LDR branches are optimized to relative B branches: * Calculate the required branch offset (since B is a relative branch) and write the B instruction. * Synchronize the I and D caches, since we modified the instruction stream. * Scan forward from the LDR/B to the next literal pool. Literal pools in JM are always _after_ the associated load, so this will never fail. Also, because the first element in the pool is known not to be a valid instruction, we'll never get false positives etc. * The first element in the pool also forms an element in the linked list of pool slots, so we can use this to insert the newly-freed pool slot into the list. Here's how B branches are de-optimized to long-range LDR branches: * Scan forward to find the literal pool, as before. * Walk through the linked list of unused slots until a matching slot is found. * Remove the slot from the list and put the new branch target into it. * Patch the B instruction to the relevant LDR instruction. * Synchronize the I and D caches, since we modified the instruction stream. The patching process is quite slow for patches that swap between B and LDR, but reasonably fast for B-to-B patches and LDR-to-LDR patches. B-to-B patching requires a cache operation, but LDR-to-LDR behaves exactly as it does now. With '-m' alone, I measured a 4.4% performance improvement on Kraken. I think the overhead of scanning for the pool is about 1.5% overall. However, I think there's more potential here, and I'd like to try some further improvements: * I currently only optimize branches when they are patched. Branches that never get patched (after code finalization) don't benefit from this. * I accidentally tested the patch with those ((noinline)) attributes in place. They're there to aid debugging. I might get marginally better performance without them. * It should be possible to cache the literal pool base at least, to remove the need to scan for the literal pool on each patch. This is not quite as simple as it sounds, however, as checking the validity of the cache is tricky. (I have a couple of ideas.) * We also store patchable literals in the literal pools. It would be trivial to make some of these patchable using the same method as the branch patching. (ARM can do some limited immediate-loads in a single instruction.) * If we know that a branch will be short-range when we generate it, we could have a method in the back-end to ignore the literal pool for these branches altogether. This would save space in the literal pools and avoid the code-scanning overheads. * If we can give a hint to the back-end that a branch will never need to be patched again, we can skip all the scanning and messing about, since we will never need to find the literal pool slot again. I don't know how often this would be useful in practice. * When patching a large batch of branches, it would probably be wise to queue up several cache synchronization calls and issue them all at once. Each synchronization requires a call into the kernel, so they aren't especially quick. I did look at patching our function call sequences (which do "LDR r8, =addr; BLX r8"), but the calls never seem to be within range of a simple BL or BLX. Our function calls are usually to C code, which gets allocated from a different address base. This might change if with bug 615487. Whilst it might be more time-efficient to store pointers to the literal pool slots in the IC structures (for example), this implementation is more space-efficient, and because the painful patches are relatively rare I though the space-efficiency would be more important. Also, this solution is entirely contained within Nitro and thus will benefit IonMonkey as well as JaegerMonkey, with little (if any) extra effort.
I've got this up to a 6.5% performance boost on Kraken! I'll post a patch and more details tomorrow. I've done the following of my suggested improvements: [X] Include branches that are marked unpatchable. [X] Turn off ((noinline)). [X] Cache the last literal pool search results. [X] Also optimize non-branch loads. I've not done the following: [ ] Allow method-JIT code to hint that a branch will never need repatching. [ ] Queue up cache synchronizations and issue them in batches. The remaining tasks are promising, but fiddly. They might be better added as separate patches.
> Whilst it might be more time-efficient to store pointers to the literal > pool slots in the IC structures (for example), this implementation is > more space-efficient, and because the painful patches are relatively > rare I though the space-efficiency would be more important. Also, this > solution is entirely contained within Nitro and thus will benefit > IonMonkey as well as JaegerMonkey, with little (if any) extra effort. Do you mean that the IC structures will be less space efficent? I think we can get away with not increasing the size of the IC structure. Rather than have one pointer that points at the instruction, and one pointer that points at the slot in the pool. We will always be able to get one from the other. If the instruction is currently a LDR, in the IC store a pointer to the instruction, and can recover the slot by the offset stored in the instruction. If the instruction is a branch, then we want to store a pointer to the slot in the pool (with 0x1 or'ed in). We can recover the address of the branch by having a pointer to it in the slot (or enough bits to recover it). Also, we may want to implement a fix to bug 675342 in the patching code. Since the "branch" instruction doesn't actually need to be a branch, we can simply output a nop rather than a branch to the next instruction.
(In reply to comment #4) > > Whilst it might be more time-efficient to store pointers to the literal > > pool slots in the IC structures (for example), this implementation is > > more space-efficient, and because the painful patches are relatively > > rare I though the space-efficiency would be more important. Also, this > > solution is entirely contained within Nitro and thus will benefit > > IonMonkey as well as JaegerMonkey, with little (if any) extra effort. > > > Do you mean that the IC structures will be less space efficent? > I think we can get away with not increasing the size of the IC structure. > [...] Yep, that could work. My current patch doesn't store anything in the IC struction at all, but does everything within the back-end. (This makes it applicable to IonMonkey for free.) It could easily be adapted, however. > Also, we may want to implement a fix to bug 675342 in the patching code. > Since the "branch" instruction doesn't actually need to be a branch, we can > simply output a nop rather than a branch to the next instruction. Yep, that would be trivial to add to my patch so I'll give it a shot now. I didn't realize there were so many pointless branches!
(In reply to comment #4) > Rather than have one pointer that points at the instruction, and one pointer > that points at the slot in the pool. We will always be able to get one from > the other. Bear in mind that many pointers in ICs are actually offsets from a base pointer. These offsets are usually single bytes, and are hard-coded in many cases to save memory. Re-using the branch label offset to point to the literal pool is possible, but maybe not as easy as it sounds and probably not free of memory overheads. Also, this would need to be done to every IC individually because they don't share a lot of code. It would be a big patch, _but_ it might win another 1.5% or so of performance, so it's worth doing. I'd propose investigating it as a separate bug, however. This one covers enough already.
Latest news: TEST COMPARISON =========================================== ** TOTAL **: 1.186x as fast =========================================== ai: - astar: - audio: 1.186x as fast beat-detection: 1.198x as fast dft: 1.149x as fast fft: 1.25x as fast oscillator: 1.162x as fast imaging: 1.25x as fast gaussian-blur: 1.27x as fast darkroom: 1.32x as fast desaturate: 1.162x as fast json: ?? parse-financial: - stringify-tinderbox: ?? stanford: 1.052x as fast crypto-aes: - crypto-ccm: - crypto-pbkdf2: 1.077x as fast crypto-sha256-iterative: - Some tests are much faster, others are about the same. I ran with "-m" in both cases, to factor out trace-jit and profiling behaviour. Patch coming shortly...
Created attachment 549780 [details] [diff] [review] Optimize literal pool loads after code finalization. This version covers the features I mentioned previously, and also the redundant branches as discussed in bug 675342. The performance gain is surprising. I've seen 6-8% consistently but I saw much more than that after incorporating Marty's suggestion from bug 675342 (as in my last comment). The patch is still full of debug code and needs cleaning up, which is why I haven't asked for a review.
Marty asked how this will port over to IonMonkey (I think I've got the right bug), and the problem sounds strikingly familiar to what happened on x64. For JM we had ran into a problem where the 64-bit address range could overflow the 32-bit offset in jumps. As a quick fix we inserted checks everywhere to disable ICs when this happened. The story in IonMonkey (which does not yet have ARM support) is a little different. Code is GC'd, so we need to be able to trace its jumps to ICs and other scripts. We achieve this by appending a compressed relocation table to each code object. On x86, the relocation table is just a list of offsets to all external jump instructions. To trace JIT code, we sniff the target of each jump. On x64, for each external jump, we reserve 16 bytes in a pool at the end of the code. If we ever need to patch the jump, and the target does not fit in a 32-bit offset, we redirect the jump into that pool. The relocation table has both the position of the small and large jump, and the trace hook figures out which one is active. Tentatively I was planning for ICs to store their jump offsets as indexes into the relocation buffer (these indexes will be small), which would also give the patching mechanism all the information it needs. I have no idea whether any of this could apply to ARM. But since it will need to trace its external jumps during GC, that might expose a patching solution that doesn't need to search for and through constant pools.
Created attachment 551055 [details] [diff] [review] Version 2.
Created attachment 551060 [details] [diff] [review] Version 2.1 (Sorry, that previous one was a mistaken form submission before I'd qrefreshed.) This one's ready, I think. I didn't have time to benchmark it but I'll get some results on Monday. It passes all the jit-tests with '-m'. There are still some things that I'd like to do, but I'd rather get this pushed first: * Immediate loads that _aren't_ branches are optimized when patched, but only if they're patched after the code is finalized. Fixing this looked trivial but quickly got messy, so I've postponed it. Most patchable loads are for branches anyway. * I'd like to move as much of the code as I can out of the header file and into the CPP file. This caused complications when I first tried it, though it shouldn't really be terribly difficult. However, I'd like to do the same to much of the header files. If nothing else, any one-line modification triggers a huge build process that should really be unnecessary, and inlining doesn't help an awful lot anyway.
Created attachment 551062 [details] [diff] [review] Version 2.2. (It really is the right one this time.)
Comment on attachment 551062 [details] [diff] [review] Version 2.2. (It really is the right one this time.) Review of attachment 551062 [details] [diff] [review]: ----------------------------------------------------------------- My only nit with this is in the comments. Could you specify the variant of BLX that is being referred to? e.g. > // *insn. If *insn is a BLX instruction preceeded by an LDR, that LDR From context, it is obvious which one we're talking about, but it would be nicer if it was explicit. And in a broader view, You should add in a comment describing why having the unused slots refer back to the instruction that they originated from is necessary. It took me a couple of minutes to figure out why we couldn't just find *any* unused slot, and grab it. I have some bigger issues with the entire scheme for linking the unused slots back to their instructions, but that can be fixed later. I'd prefer to get the patch in sooner, and make it algorithmically better later.
(In reply to Marty Rosenberg [:Marty] from comment #13) > Comment on attachment 551062 [details] [diff] [review] [diff] [details] [review] > Version 2.2. (It really is the right one this time.) > > Review of attachment 551062 [details] [diff] [review] [diff] [details] [review]: > ----------------------------------------------------------------- > > My only nit with this is in the comments. Could you specify the variant of > BLX that is being referred to? > e.g. > > // *insn. If *insn is a BLX instruction preceeded by an LDR, that LDR > > From context, it is obvious which one we're talking about, but it would be > nicer if it was explicit. I'll update the comment. > And in a broader view, You should add in a comment describing why having the > unused slots refer back to the instruction that they originated from is > necessary. It took me a couple of minutes to figure out why we couldn't > just find *any* unused slot, and grab it. Actually, one of the things I'd like to do to improve it is to patch loads to use the last empty slot. Because pools are always positioned after the loads, this still guarantees that we have available slots for every B-to-LDR patch. We could then store a bit array at the start of each pool to avoid the expensive linked-list traversal. It doesn't solve the problem of finding the literal pool in the first place, though. That will require some pointer to be stored outside the code. However, as dvander pointed out, we can probably get all this for free in IonMonkey so it might not be worth worrying about too much in the context of JaegerMonkey. > I have some bigger issues with the entire scheme for linking the unused > slots back to their instructions, but that can be fixed later. I'd prefer > to get the patch in sooner, and make it algorithmically better later. Agreed.
(In reply to David Anderson [:dvander] from comment #9) > I have no idea whether any of this could apply to ARM. But since it will > need to trace its external jumps during GC, that might expose a patching > solution that doesn't need to search for and through constant pools. Yep, it looks like this system could be used for much of the patching on ARM. We still have some other patchable things in JM, such as shape guards and the like, where we need to patch a 32-bit value that isn't necessarily a branch. Presumably, we could use the same mechanism for these. Also, it will be interesting to see how literal pools will be handled in IM (if this has been considered yet). They will need to be attached to the GC-able chunks so depending on the size of the chunks, some attention might need to be paid to how we handle them. Depending on the constants we most often need to handle, it might be sensible to do away with literal pools altogether. In the worst case on ARMv6 (and below), it takes 4 instructions to load a 32-bit value, though most real constants are probably much simpler and on ARMv7 we can load any value in 2 instructions (MOVW+MOVT).
Backed out because of test failures on Android: https://hg.mozilla.org/integration/mozilla-inbound/rev/29e59859d415
Any relevancy for IM, or INVALID?
The 'JM' tag on this is a bit of a misnomer. It is technically an improvement to the JSC/arm assembler, which is only used by the JM compiler... *and* YARR. IIRC, when I was looking at the code that YARR created, there were a metric ton of branches (usually load, compare branch, repeat), so this patch has the ability to improve perf on regexp-based tests (and nothing else). It should be pretty easy to determine if it helps the one or two regexp tests we care about, and discard the patch if it doesn't actually help anything.
Ok, renaming the bug, then. @jbramley, I unassigned you, but feel free to revert that if you are, in fact, going to work on it. :)
Marty: is this ARM bug still relevant now that YARR has been with irregexp in Firefox 32 (bug 976446)?
It shouldn't be.
(In reply to Marty Rosenberg [:mjrosenb] from comment #23) > It shouldn't be. WONTFIX as per comment 23.