User-Agent: Mozilla/5.0 (Macintosh; PPC Mac OS X 10.4; rv:2.0b8pre) Gecko/20110107 Firefox/4.0b8pre TenFourFox/Debugging Build Identifier: This adds code for Firefox nanojit support for PowerPC to the tree. This bug does NOT enable PPC nanojit in the build (thus NPOTB); that will be in a future patch. However, it is already enabled in TenFourFox and will be part of beta 9 of that browser. I discussed this work with Edwin Smith in E-mail already -- thanks for all your help, Ed! This patch does the following: - Adds nFragExit to NativePPC.cpp - Adds support for the overflow math instructions to asm_arith() in NativePPC.cpp - Completes nPatchBranch in NativePPC so that it can "demote" 14-bit and 24-bit branch displacements to CTR-based branches, and conversely "promote" CTR-branches to 14-bit displacements where possible, in NativePPC.cpp - Adds additional opcodes to NativePPC.h required by the above - Implements a basic instruction scheduler that hoists independent instructions up higher to facilitate better ILP. This is done by (ab)using the EMIT1 macro in NativePPC.h to do instruction swapping, and adding a new struct to Assembler that tracks instruction history. Special cases are added for certain common sequences that this scheduler does not catch, as well as macros for disabling the optimizer for sequences that must be emitted in strictly serialized order. Performance is mixed. It actually takes a non-trivial hit on SunSpider to use the JIT, although it does significantly better on many parts of Dromaeo. The net effect is overall positive for most tasks, but not nearly enough. Therefore, I'm posting this first working draft to get more eyes on it and suggestions for further optimization. Notes on stuff that came up during development and other related bugs Ed and Nick Nethercote pointed out in our E-mail conversation: - There are lots of stack stores in hot code related to side exits (an empty loop takes an unbelievable amount of time because of this). This hurts badly on POWER which can take a big stall hit for memory access. The suggested solution for this is, as I understand it, controversial (bug 537842). - Some of the load and store noise could be improved by bug 514102, and possibly further with bug 602793. - Bug 545406 "TM: loop invariant code motion (LICM)" will probably benefit PPC greatly. - For certain cases where we can statistically predict the PPC nanojit will take a bath with traced code, bug 597439 where we can let the compiler do our optimization (of the LIR interpreter itself) may also be useful. I have not made any tracer changes in this patch to that end, although I am exploring them. Reproducible: Always Steps to Reproduce: n/a Actual Results: Warp 3. Expected Results: Ludicrous speed.
Created attachment 502270 [details] [diff] [review] PowerPC nanojit for Firefox, working (v1) As above. Ed, are you comfortable reviewing this (there are no TM-specific changes), or should I add Nick too?
We've gone through great efforts to ensure that Assembler.cpp does not contain platform ifdefs (re: #ifdef NANOJIT_PPC). Not having reviewed the entire patch, but knowing that Tamarin has a PPC back-end with no leaky ifdefs, I'm guessing that we'd want this patch appropriately massaged.
I could probably move the history structure out, but I need some way of dealing with LIR_label so that the peephole swaptimizer doesn't inadvertently change instructions around on a branch destination. What would be the best way of doing that? Similarly, where could I hook the startup of the assembler so I can set the swaptimizer's default state?
(In reply to comment #3) > I could probably move the history structure out, but I need some way of dealing > with LIR_label so that the peephole swaptimizer doesn't inadvertently change > instructions around on a branch destination. What would be the best way of > doing that? > Would _labels give you enough information? > Similarly, where could I hook the startup of the assembler so I can set the > swaptimizer's default state? nBeginAssembly() is probably the right place. It's called right before any code is emitted.
nBeginAssembly() looks okay for my purposes. _labels would probably give me the information I need, but it looks like I'd have to check it on every instruction I intend to swap, which would really chug (unless I misunderstand your suggestion). Incidentally, I'm starting to get feedback from TenFourFox users about the nanojit, which is live in beta 9. On G4, it cuts SunSpider time in half, so it appears to be very valuable for 74xx. Since most of the microcontrollers, PPC Amigas and Power Macs out there are G4-derivatives, this implies the nanojit may be a bigger overall win in the general case. G3 testing is still pending, but I can't seem to get it any quicker for G5.
(In reply to comment #5) > it looks like I'd have to check it on every instruction > I intend to swap, which would really chug (unless I misunderstand your > suggestion). > I was hoping that you'd only need to check the _labels when you wished to swap. Is the issue that we only have a mapping from LIns to NIns and not vice-versa? If so, then I think its fine to add it. > Incidentally, I'm starting to get feedback from TenFourFox users about the > nanojit, which is live in beta 9. On G4, it cuts SunSpider time in half, so it > appears to be very valuable for 74xx. Since most of the microcontrollers, PPC > Amigas and Power Macs out there are G4-derivatives, this implies the nanojit > may be a bigger overall win in the general case. G3 testing is still pending, > but I can't seem to get it any quicker for G5. Sounds like great results.
Yes, I was very happy to get those stats, it meant the work wasn't in vain! I would indeed only need to check _labels when swapping, but virtually every instruction is a candidate for swapping by default so that instruction reordering can occur between LIR opcode code generation. The way I have it written, when a LIR label comes down the pipeline, the swaptimizer is hinted to disable swapping for the next instruction. That way, the "trap" is only sprung when a label is actually there. Is there another way I can trap this?
Are you able to run under Shark on the G5 and get any idea what's going on? I've got some rudimentary code for shark jit symbol generation. It would need some adaptation for TM, and it only works on 10.5 with shark 4.5.0 or earlier... but if your G5 is sufficiently out of date (or if you can dual boot), it should be usable. The repo uses named branches, you want the "shark" branch: $ hg clone -r shark http://hg.mozilla.org/users/edwsmith_adobe.com/profiling
I had Shark up on some representative traces and it seemed to be more a factor of stalls. JSOPs with big-league guard activity seemed the worst (no surprise there). For TenFourFox I'm shipping beta 11 with a G5-tuned jstracer.cpp which, as a starting point, blacklists the JSOPs that I empirically determined to be slow. This is clearly wallpapering the problem but it gets SunSpider within 500ms of interpreter speed, and pushes both Dromaeo and V8 into the win category. Obviously this sort of tuning is out of the scope of this bug. I'll need to scare up a 10.5 PPC, as my iBook and G5 both run 10.4 because of software constraints. I'll take a look at the shark branch as soon as I do.
Comment on attachment 502270 [details] [diff] [review] PowerPC nanojit for Firefox, working (v1) Marking -'ve until follow-up patch is posted for review.
No worries, but what would you suggest for comment 7? I want to be as efficient as possible since all instructions are swappable except those specifically marked immovable.
Seems silly for this bug not be NEW.
now the submitted patch doesn't apply to firefox-4rc2 sources. Anyway how activate nanojit_ppc during a build from source on linux ppc?
Created attachment 520568 [details] [diff] [review] PowerPC nanojit for Firefox, working (v2) (not for review) This is the shipping nanojit in TenFourFox final. It has several minor fixes for stability. I haven't done the other changes yet, so this is just for bookkeeping purposes since I notice some interest in the bug. Rick, would it help if I just took the swap portions out (making EnableSwap() and DisableSwap() no-ops) for now, since that seems to be the part we're stuck on for approval? I could file completing the swaptimizer as a follow-up bug and we could discuss how to do that efficiently there. It will take a bit of a runtime hit without it, but at least we can get that much in the tree. Also, I have done some preliminary work on the extended loadstore architecture for typed arrays, but I'll also file that in a follow-up bug. This should bring PPC almost up to the level of SPARC.
(sorry for bugspam) Nello, this patch is just about getting the PPC nanojit core updated. It does not actually enable it for building.
(In reply to comment #14) > Created attachment 520568 [details] [diff] [review] > PowerPC nanojit for Firefox, working (v2) (not for review) > > Rick, would it help if I just took the swap portions out (making EnableSwap() > and DisableSwap() no-ops) for now, since that seems to be the part we're stuck > on for approval? That's one approach. What I was attempting to convey earlier is that we should try to extract the PPC specific pieces into PPC only files. For example could the ifdefs in clearNInsPtrs() be moved to non-ifdefs in nativePageReset? DECLARE_PLATFORM_ASSEMBLER exists to house just such platform specific calls and might be the right way to go (although I haven't looked in detail at the changes yet). Liksewise can we move InsHistoryEntry and dontSwap out of Assembler? Other than that the rest of the code being PPC specific and being in NativePPC is good. Let me know if you're still stuck with some of this partitioning and we can figure out the best way to go forward.
Those are all no problem. It's the part where I need to force the swap off when a LIR_label comes along that has me stuck. The EMIT macro is as large as it is so that it can peephole-swap between LIRops, but I do not want to swap around instructions if the instruction in question is the destination of a branch. I want to avoid a situation where I have to check on every instruction that's emitted whether it was situated at a LIR_label or not (the current code simply sets a flag as a "fence" whenever it's crossed, which is very efficient). What would be your suggestion for that? I should note that I'm puzzling through an issue where sometimes not enough room is put on the stack by the prologue. This is not fixed yet and is handled in TenFourFox by some strategic placement of trace aborts, out of scope of this bug. I'll deal with that as a followup in the interest of expediency.
I'm shipping a fix for G5 in TenFourFox 4.0.1 that replaces the mcrxr instruction that the LIR overflow math operations use with an equivalent. A big part of the slowdown on G5 simply turned out to be that mcrxr is emulated in software on the 970, which I found while looking up something else in Apple source. Strange that Shark never pointed this out. This probably also affects Cell PPE, Xenon and POWER5/6/7. However, the fix is slower for G3/G4, which have mcrxr in hardware. So, in addition to the labels question, do we have any facility for determining which CPU we're compiling for? TenFourFox has its own switches since we build four separate processor-specific releases. Also, throwing in a couple nops in LIR_d2i also reduces pipeline stalls -- similarly, this is also marginally slower on G3/G4, so they shouldn't get this fix either.
There is a limited amount of CPU sniffing in njconfig.cpp, to detect x86 features. If you wanted to add flags for g4/g5 detection, that would be the place. I don't know of any existing cpu-detection code for PPC, however.
Cameron, any chance we can get this patch up to "please review" status? I would very much like to see PPC Nanojit in SpiderMonkey tar ball releases at some point in the future.
It's the swapping instructions issue I still need some insight from Rick or Edwin on. I don't have any problem moving the custom stuff from Assembler.cpp out; I'm just not sure where to move them to where they can still spy on the LIRops. Failing that, in the short term I can still generate a PoC that doesn't use the swaptimizer; it just won't be as valuable on in-order architectures.
Comment on attachment 520568 [details] [diff] [review] PowerPC nanojit for Firefox, working (v2) (not for review) This this patch still current? Here's how you could move the PPC specific parts out of Assembler.cpp: 1. Move the swap code from clearNInsPtrs() to nativePageReset() in NativePPC.cpp. 2. Make a new asm_label() function implemented by each backend, and move the PPC swap code into NativePPC.cpp. call asm_label() after countlir_label() in case LIR_label in Assembler.cpp. The PPC definitions for InsHistoryEntry, etc, can move to NativePPC.h's DECLARE_PLATFORM_ASSEMBLER macro.
That makes perfect sense. I will start converting it and expanding the patch. One last question: I couldn't find a non-Mac specific way of determining the CPU (on the Mac, we could of course query Gestalt, but that won't work for other POWER systems). Any objection to simply using #ifdefs?
yeah, platform specific ifdefs are okay for that. Take a look of njconfig.cpp for whats been done already for i386 (compiler-specific ifdefs for inline asm) and CodeAlloc.cpp (os-specific ifdefs for i-cache flushing). Do your best to keep the ifdef hair under control, but what ya gotta do, ya gotta do.
Created attachment 541115 [details] [diff] [review] Part 1: add asm_label() to all nanojit backends Taking officially. Let's getting this rolling again. Per Ed's comment 22, let's break this into pieces and start by getting asm_label() into all the current backends. Part 2 will be the actual PPC backend (which is now evolved further to handle expanded loadstore and LIR_cmovd, btw). Rick, are you still okay with reviewing it?
Comment on attachment 541115 [details] [diff] [review] Part 1: add asm_label() to all nanojit backends Looks fine to me so far and sure I can do the reviews.
Created attachment 542714 [details] [diff] [review] Part 1: Add asm_label() to all backends (for check-in) Carrying r+ forward
Created attachment 542715 [details] [diff] [review] Part 2: PPC backend Sorry, Rick, it's marking "me" as having approved my own patch to part 1-- can you fix? Part 2. This is the actual backend, as used and proven in TenFourFox, and has the following changes: - nFragExit works - overflow instructions and branches work - LIR_cmovd and expanded load store - LIR_d2i - CR7 to CR0 for ABI prettiness - branches of arbitrary displacement can now be patched (demoted and promoted) - instructions are analyzed for dependencies and dynamically reordered for better ILP on in-order POWER CPUs (see overgrown EMIT1() macro in NativePPC.h) - optimized codepath for PPC970 (#ifdef _PPC970_); compiles for G3 and G4 (or other CPUs with mcrxr) by default Things it does not have yet: - NANOJIT_64BIT is still not finished, although I did some minor additional work. However, we don't even build a 64-bit TenFourFox; no advantage to us really. - LIR_muli could benefit from a shifter, but I had trouble getting this to work algebraically. - I toyed with making asm_li32() use a 3-way adder, but abandoned it. This passes all tests but bug 576837, but the non-JIT interpreter on PowerPC fails on it too; I will file a separate bug and fix it. On V8 on a G5 2.5GHz quad, js/src/v8/% ../../../obj-ff-dbg/dist/TenFourFox.app/Contents/MacOS/js run.js Richards: 153 DeltaBlue: 202 Crypto: 107 RayTrace: 404 EarleyBoyer: 524 ---- Score: 234 js/src/v8/% ../../../obj-ff-dbg/dist/TenFourFox.app/Contents/MacOS/js -m -j -p run.js Richards: 1753 DeltaBlue: 1276 Crypto: 913 RayTrace: 341 EarleyBoyer: 489 ---- Score: 806 js/src/v8/% perl -e 'print 806/234' 3.44444444444444
Btw, Part 3 will be "not for review," just for informative purposes; more on that when I post it.
Sorry for the delay Cameron, will review this tomorrow, but a quick glance doesn't reveal anything obvious wrong. And nicely done, the results look impressive.
Comment on attachment 542715 [details] [diff] [review] Part 2: PPC backend Overall the changes look fine, but I have not examined each in detail. Marking r+ as I believe this is a solid win for the architectures Cameron reported and that it has no negative impact on tamarin. Also asking Edwin to sr the changes as he wrote this back-end.
Right-o. Thanks, Rick! I'll wait for Ed's approval when he's back.
Just a note: I just noticed line 516 in the new NativePPC.cpp should have a SwapEnable() before it calls asm_branch_far(). This doesn't affect tests, just performance. I'll fix that in the commit-ready patch after Edwin's sr.
review ping :)
Thanks, Ed :) I'll rebase the patch against mozilla-central and comment 33.
changeset: 6642:f25fa9e0eb3d user: Cameron Kaiser <spectre> summary: Bug 624164 - [NPOTB] complete PowerPC nanojit for Firefox (r=rreitmai sr=edwsmith) http://hg.mozilla.org/tamarin-redux/rev/f25fa9e0eb3d
changeset: 6643:072b9c27b990 user: Cameron Kaiser <firstname.lastname@example.org> summary: Bug 624164, part 1: Add asm_label() to all backends. r=rreitmai http://hg.mozilla.org/tamarin-redux/rev/072b9c27b990
Cameron, asm_label() was removed from the Assembler class, so I had to tweak the patch slightly prior to landing. If this introduces a problem, feel free to open a new bug and reference this one.
I was looking at the patch in comment 39 and I didn't see where the PPC's asm_label() code got integrated. However, I'm working on little sleep, so it may be fatigue X-/ This might all wind up being moot for Firefox because of bug 693815, so I'm dithering a bit on what to do.
Yeah it looks like the patch has created some compilation problems (probably due to my editing of it). Virgil sent some patches to fix it up. Opening bug 694641 to house them for now.
Nanojit has been dead for several years. Its Bugzilla component has been moved to the graveyard (bug 984276). I checked all the open bugs. They're all uninteresting, so I'm WONTFIXing them all. Apologies for the bugspam.