Closed Bug 486639 Opened 16 years ago Closed 15 years ago

TraceMonkey: ARM/Thumb Interworking branches are not generated by nanojit.

Categories

(Core :: JavaScript Engine, defect)

ARM
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jbramley, Assigned: jbramley)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 6 obsolete files)

Some general background for those reading this bug who are not familiar with ARM: * ARM processors (generally) support two processing mode: ARM and Thumb. The differences are not important, but in order to switch between the two it is necessary to use a specific type of branch. Using this instruction sequence is necessary for interworking ARM and Thumb code, but some extra fudging is required as these instructions aren't available on some old ARM cores. The reason I haven't just made a patch to address this is that getting the Trace Monkey VM to build for Thumb is rather difficult. There's no configure option and I was only able to get anywhere by hacking the generated makefiles. My point is thus: * _If_ we ever expect to compile the VM itself for Thumb (or Thumb-2), we _need_ to address this issue so that branches from nanojit-generated ARM code can interwork properly. * If we don't ever expect to compile the VM for Thumb, this isn't an issue at all, provided that the generated code guarantees never to call anything outside of TM. For example, if generated code calls into the C library and the C library is compiled for Thumb-2, interworking calls will be required. Building the VM for Thumb-2 may have useful benefits over ARM. Thumb-2 provides a tremendous code size advantage over ARM for roughly the same performance, but is only available on ARMv6T2 and ARMv7+ cores. (That includes the Cortex-A8 but not ARM1176.) The problem here is that we would also need to generate an ARM binary distribution for those using ARM1176, so falling back to ARM as the default but allowing a Thumb-2 build configuration for people building from source may be a good option. It seems that we have two obvious options: * Add something to the configure script to allow building TM itself for Thumb(-2) _and_ implement interworking branches. - This is the best option as far as testing is concerned as it will be trivial to detect regressions and the like by simply running the test suite with the new build configuration. - This option entails the most work. * Do nothing for now. It's possible that building for Thumb was deliberately removed at some point, and we will never need interworking support in the generated code. We may need to address this in the future as I think building the VM for Thumb-2 could be beneficial. Any feedback will be appreciated. Thanks! Jacob
In previous discussions with Adobe we decided that Thumb1 would not be a good target for a JIT. The code cache is supposed to contain code that is inherently hot, so trading off compactness for performance for hot code seems silly. Hence we removed Thumb1. We also discussed that Thumb2 most likely would actually be an improvement, so we want a Thumb2 target. I am not sure how to do the configuration. We seem to be drifting towards runtime configuration (vlad just added VFP detection). If the backend is small enough we might tolerate having a Thumb2 code generation mode in parallel to regular ARM. Any comments? vlad? ed?
I'm talking about the VM itself, not the JIT. I think that a Thumb2 _JIT_ would certainly be beneficial, but I'm talking about compiling TM itself for Thumb2 (but generating ARM code, as it currently does).
the tamarin nanojit arm backend contains interworking support so the C code can be compiled for arm or thumb. but, NativeARM.cpp/h may have diverged too much in the two branches (tamarin / tracemonkey) I agree, Thumb2 would be great. It should be a separate backend from NativeARM.cpp, however (clean slate).
Yep, I'm working from one of Andrew's patches for Tamarin. Whilst it won't apply cleanly to TM, I know how the changes work and it wouldn't take me long to implement equivalent changes; the build system modifications might take me longer. The question is whether or not it matters; if TM is only intended to be compiled as ARM (regardless of the JIT output), there is probably no need to generate interworking code. --- A clean slate would be nice for a Thumb2 JIT, but I'd like to move that discussion out of this bug as it's confusing enough already :-) (I'm on #jsapi as "Jacob" if you want to discuss it.)
> * _If_ we ever expect to compile the VM itself for Thumb (or Thumb-2), we > _need_ to address this issue so that branches from nanojit-generated ARM code > can interwork properly. I don't think we'll do this for spidermonkey. > * If we don't ever expect to compile the VM for Thumb, this isn't an issue at > all, provided that the generated code guarantees never to call anything outside > of TM. For example, if generated code calls into the C library and the C > library is compiled for Thumb-2, interworking calls will be required. The second part is more interesting -- we'll be mostly calling mozilla code, but I can see a future world where we'd want to make more and more direct calls on trace. I don't think there's a big cost to generating iterworking branches, is there? Though... I haven't thought this through all that much, but intra-trace branches would just be normal, and we already restore pc on return via ldmia which will handle the mode switch (on new enough architectures). I guess for ins_call we don't generate bx but just b? Is there any other change that would be needed?
There is a slight complexity cost if we want to support ARMv4 and ARMv4T cores. * ARMv4 (~ARM7) didn't have Thumb support, and doesn't understand the BX instruction. * ARMv4T (~ARM7TDMI) understands BX, but not BLX, so interworking has an associated cost on ARMv4T. * ARMv5 and onwards are much easier to deal with. They support BLX, which does everything in one instruction, but to honest I'm not sure if interworking impacts performance at all. I would guess that it doesn't, but I have no figure to back this up. More details are available on a handy table on this page: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204i/Cihfddaf.html --- ARMv4(T) ARMv4 is pretty unusual, and I doubt that anyone would want to run TM on it. However, I have no figure to back this up so I can't say for sure. I think the best-known ARMv4 chip is the StrongARM. ARMv4T includes most ARM7 variants (such as the ARM7TDMI) and a few ARM9 variants. The GP2X, for example, has an ARMv4T ARM9 in it (according to Wikipedia). --- Code The code itself is easy enough to write, and I already have an old Tamarin patch to work from so it won't take me long to implement the interwork code itself. I won't be able to generate simple or automatable tests, though. Intra-trace branches are no problem at all because the target is known to be ARM code. --- Thumb2 for Trace Monkey Itself I do think that compiling TM for Thumb-2 would be beneficial. Even if we don't do this now, it could be useful in the future. T2 gives you almost the performance of ARM with the code size of T1, so even though it's not supported on ARM1176, it could be worth doing. In any case, if we want to support interworking branches anyway we will automatically support interworking into a potential future T2 TM.
Assignee: general → Jacob.Bramley
Attached patch Generate interworking branches. (obsolete) — Splinter Review
--- Patch Summary * asm_call uses the new BLX method to generate interworking branches. - BLX() will generate interworking branches on ARMv4T, which doesn't support the actual BLX instruction. - BLX() is clever enough to generate simple (non-interworking) branches for ARMv4(T) when the target is not Thumb. This is always the case for plain ARMv4. * The processor feature detection has been modified to allow detection of ARMv5 support (along with other features). - TODO: I can't test the WinCE code here. It looks Ok, but could someone try this out please? * The patch works only if asm_call is the only function used to call out of generated code. That appears to be the case as it handles LIR_call, which is generated by insCall, but I'm not entirely familiar with the code so there could be some other way that calls can be generated. * Existing branch mechanisms for branching within (and directly between) generated code fragments are unchanged, as the target will always be ARM code and interworking is not required. * The exit branches were generated as "BX"; this won't work for ARMv4, so I've added a fix for this too. * A number of asm_output() calls have been tidied up to be more informative of what's actually generated. --- Bug Reproduction Whilst I could hack the generated makefiles to get TM itself in Thumb, the generated code always seems to be far enough away from the native code that a simple BLX is never generated. The original code used "LDR PC, =addr" in this case, and on ARMv5+ this type of branch supports interworking so the behaviour is correct. (I don't have an ARMv4T platform to test on.) However, this is still a potential bug as we can't guarantee the memory locations of each code element. To clarify: * The bug will manifest on ARMv5+ if the target address is within 32MB of the branch instruction _and_ the target is Thumb (or Thumb-2) code. * The bug will manifest on ARMv4T regardless, as LDR PC does not interwork on this architecture. * A slightly different bug will manifest on ARMv4, but that should be fixed by my patch. --- Old Architectures There's a lot of code in there to handle ARMv4(T). This does make me wonder if it's worth supporting these. What is the policy on this? Can we assume ARMv5+, or do we need the ARMv4 support code? If we don't need to support ARMv4, I'll strip that stuff out of the patch to make it simpler and cleaner. Note: I think Tamarin supported ARMv4T, but I don't know about ARMv4. --- Testing No additional failures arise from adding this patch on either trace-tests.js or js/tests.
Status: NEW → ASSIGNED
Attachment #371456 - Flags: review?(vladimir)
Attachment #371456 - Attachment is obsolete: true
Attachment #371456 - Flags: review?(vladimir)
Comment on attachment 371456 [details] [diff] [review] Generate interworking branches. Most of the changes in the patch actually relate to platform detection, so I'll split that out as a separate patch.
Attached patch Simplified patch. (obsolete) — Splinter Review
Bug 487416 now covers the feature detection, so this new patch is cleaner than the previous one.
Depends on: 487416
Attachment #371687 - Flags: review?(vladimir)
Comment on attachment 371687 [details] [diff] [review] Simplified patch. Looks great!
Attachment #371687 - Flags: review?(vladimir) → review+
Attached patch A couple of minor corrections. (obsolete) — Splinter Review
Oops... My previous patch had an erroneous underrunProtect and some broken comments. The new one fixes this. It probably doesn't need another review but I've tagged it anyway to be on the safe side.
Attachment #371687 - Attachment is obsolete: true
Attachment #371815 - Flags: review?(vladimir)
Blocks: 487607
Comment on attachment 371815 [details] [diff] [review] A couple of minor corrections. Bah, I even looked at that underrunProtect and convinced myself that it was right somehow :p
Attachment #371815 - Flags: review?(vladimir) → review+
Keywords: checkin-needed
Attached patch Remove ARM4(T) support. (obsolete) — Splinter Review
We have previously decided that we don't need to support ARMv4(T). My previous patch included a significant amount of code for handling ARMv4(T), so I would like to remove this. The attached patch will work on ARMv5+.
Attachment #371815 - Attachment is obsolete: true
Attachment #376694 - Flags: review?(vladimir)
Attachment #376694 - Attachment is obsolete: true
Attachment #376695 - Flags: review?(vladimir)
Attachment #376694 - Flags: review?(vladimir)
vlad, could we get a review on the final patch here?
Comment on attachment 376695 [details] [diff] [review] Removes ARMv4(T) support. (Fixes a comment.) Oh, perfect -- I actually forgot about this patch. The previous BL impl for a long offset was horrible for perf; note that nothing uses BL() any more, so can you just nuke it? I'm not sure if the BLX impl with a 24-bit offset is correct though; BLX with an immediate offset will always exchange instruction modes, as best I can tell. I think we need to check whether we're branching to Thumb or ARM code, and emit BLX or BL as appropriate, right? Also, for the case where we need to do a LD32, make the underrunProtect(4+LD32_size) instead of hardcoding 12. Here's the code that I had: intptr_t offs = PC_OFFSET_FROM((addr&~3),_nIns-1); if (isS24(offs>>2)) { // we already did an underrunProtect(4) above if (AvmCore::config.thumb && ((int32_t)addr) & 0x1 == 1) { // we need to branch to thumb, so emit BLX here int32_t h = (((int32_t) addr) & 0x2) >> 1; // BLX addr (via offs & H) *(--_nIns) = (NIns)( (0xFA | h)<<24 | ((offs>>2) & 0xFFFFFF) ); asm_output("blx %p", addr); } else { // just a normal BL // BL addr (via offs) *(--_nIns) = (NIns)( COND_AL | (0xB<<24) | ((offs>>2) & 0xFFFFFF) ); asm_output("bl %p", addr); } } else { if (AvmCore::config.thumb) { underrunProtect(4+LD32_size); // BLX IP *(--_nIns) = (NIns)( COND_AL | (0x12FFF3<<4) | IP ); asm_output("blx ip"); } else { underrunProtect(8+LD32_size); MOV(PC, IP); MOV(LR, PC); } LD32_nochk(IP, (int32_t)addr); }
Yep, BL can be removed, as in bug 487607. --- Argh! Yes, you're right! Not sure how I managed that. I'll fix the patch. --- Using 4+LD32_size is good; I didn't know about LD32_size when I wrote the patch.
Keywords: checkin-needed
Attached patch Fixes BLX for ARM targets. (obsolete) — Splinter Review
Mostly the same but adjusts some comments and emits BL if the target is ARM.
Attachment #376695 - Attachment is obsolete: true
Attachment #378068 - Flags: review?(vladimir)
Attachment #376695 - Flags: review?(vladimir)
Comment on attachment 378068 [details] [diff] [review] Fixes BLX for ARM targets. >+ } >+ } else { >+ // BLX IP >+ *(--_nIns) = (NIns)( (COND_AL) | (0x12<<20) | (0x3<<4) | (IP) ); >+ asm_output("blx IP (=%p)", addr); I missed this earlier -- this needs to be 0x12<<20 | 0xFFF << 8 | 0x3<<4 (or just 0x12FFF3 << 4) -- the 12 bits there are SBO, not SBZ, right? Also, nit -- can you make that IP lowercase in the output? (The other register names are all lowercase in the asm output)
Yep, you're right again! I read "SBO" as "should be zero", even though it clearly doesn't mean that! Interestingly, the code still worked before; it seems that current processors don't really care what those bits are set to, though this may break if they are used in the future.
Attachment #378068 - Attachment is obsolete: true
Attachment #378301 - Flags: review?(vladimir)
Comment on attachment 378301 [details] [diff] [review] Fixes BLX encoding and uses lower case register names in debug output. Looks good, thanks for making the changes -- sorry I didn't catch the SBO issue last review time!
Attachment #378301 - Flags: review?(vladimir) → review+
Keywords: checkin-needed
Vlad, will this work on WinCE? I hear that the WinCE emulator only supports ARMv4 with one or two ARMv5 instructions, but I don't know if that includes BLX. If not, we'll have to wrap it in an #ifdef or something ugly like that, and emit BL for WinCE (since it doesn't use Thumb anyway).
The emulator's BLX implementation is actually broken or somesuch. However, for us at least, I don't think we care -- it's basically impossible to run Fennec/Firefox/whatever in Microsoft's emulator; it's just too slow.
Keywords: checkin-needed
Whiteboard: fixed-in-tracemonkey
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: