We currently emit LIR to write back all writes to the stack and globals. To speed this up, the StackFilter drops redundant stores to the same stack location (modulo instructions like LIR_xbarrier). Writes to globals could be optimized in the same way. The StackFilter also has a writes-over-the-high-watermark optimization which only makes sense for the stack, so it might make sense to split this into two optimization pieces: one that drops writes-over-the-high-water-mark (StackFilter) and one that drops redundant writes to the stack and globals (StoreFilter?). I have no idea how much redundant writes to globals are costing us right now; I'm just recording this in a bug.
FWIW, StackFilter has a non-negligible cost. We used to run StackFilter twice, once for SP and once for RP. Bug 515138 merged those two, saving a few ms in SS.
(In reply to comment #1) > FWIW, StackFilter has a non-negligible cost. We used to run StackFilter twice, > once for SP and once for RP. Bug 515138 merged those two, saving a few ms in > SS. So, was it faster because of the few less virtual dispatches? If that's the case, then we can just hack it into the single StackFilter. But more than that, I would suggest, orthogonal to this bug, that we use recursive templates to remove all virtual dispatch (allowing the stages to be inlined). I heard Andreas already tried this, and it didn't matter, which is why I didn't think that splitting one pass into two would make any measurable difference. Either way, if we are saving a few writes to memory on every iteration of a big loop, I think it would be well worth the extra record time. What I, or someobody, needs to do is just measure it for, say, our favorite benchmarks.
I don't know if the virtualness of the dispatch is important. Either way, merging the two StackFilters was a win because it removed a stage in the pipeline. Ie. for every LIR instruction it removed a read() function call, and then one or more isStore/isGuard tests. And those tests are highly unpredictable. As for the recursive template idea, it's been tried for both the reader and writer pipelines (I can't remember the bug numbers) and didn't speed things up in either case. And it's horribly inflexible, eg. it makes adding print stages really difficult. And the code looks awful too. If we really wanted to slim down the reader pipeline, bug 515305 is the way to go, you won't be able to better than that -- I've effectively manually inlined the entire pipeline, taking advantage of domain-specific knowledge that no compiler has -- but Andreas thought it was too inflexible. And yes, benchmarking will indicate if the runtime benefit is worth the extra compile cost.
(In reply to comment #3) > As for the recursive template idea, it's been tried for both the reader and > writer pipelines (I can't remember the bug numbers) and didn't speed things up > in either case. And it's horribly inflexible, eg. it makes adding print stages > really difficult. And the code looks awful too. If done correctly, it would look like this: #ifdef DEBUG lir = new VerboseStage1<VerboseStage2<Stage1<Stage2<Stage3<> > > > >(lirbuf); #else lir = new Stage1<Stage2<Stage3<> > >(lirbuf); #endif Is that so inflexible and awful looking? I mean, its not pretty by any stretch of the imagination, but hardly warranting such diatribe. Inlining would remove the virtual function calls betweens stages, and we would maintain the logical separation of stages, losing only the ability to do runtime pipeline reconfiguration. True, bug 515305 is able to hand-merge the stages into a single agglomerate, allowing us to avoid redundant unpredictable branches, but that's a separate decision of "do we wish to have a modular pipeline or a big lump o' code", whose answer may very well be "lump o' code is fine, thank you very much".
Another problem was that lots of code has to be moved into .h files because GCC doesn't have the 'export' keyword. It's also occasionally useful to turn off a phase by hand and that requires a lot more text editing with the template version. But more generally, for the case I worked on (I can't remember if it was the reader or writer pipeline) the template version was slower, so the aesthetics discussion was moot. I'm no C++ template guru so perhaps it could be done better, I'd be happy to be proven wrong.
(In reply to comment #4) > If done correctly, it would look like this: > > #ifdef DEBUG > lir = new VerboseStage1<VerboseStage2<Stage1<Stage2<Stage3<> > > > >(lirbuf); > #else > lir = new Stage1<Stage2<Stage3<> > >(lirbuf); > #endif Nit on C++ style: so even if 0x fixes this, we're stuck with the " >" -- in which case I advocate that our style guide specify "< " to balance. The above is harder to read because of the asymmetric spacing. /be
(In reply to comment #5) > Another problem was that lots of code has to be moved into .h files because GCC > doesn't have the 'export' keyword. Yeah, that part isn't fun. To be fair, only EDG has export; apparently its quite hard to implement. > It's also occasionally useful to turn off a phase by hand and that requires a > lot more text editing with the template version. More than modifying the two instantiation lines I listed above? I'll stop hypothesizing for now and take a shot after some current tasks complete. (In reply to comment #6) That makes sense. Should it go in the style guide?
(In reply to comment #7) > > It's also occasionally useful to turn off a phase by hand and that requires a > > lot more text editing with the template version. > > More than modifying the two instantiation lines I listed above? Ah, |if (js_LogController.lcbits & LC_TMRecorder) add VerboseWriter| and |if (nanojit::AvmCore::config.soft_float) add SoftFloatFilter|... I see. That doesn't fit so well in a statically-defined pipeline :P.
Created attachment 408688 [details] [diff] [review] faster! but only on opposite day The attached patch is pretty simple; it just lets the StackFilter kill dead stores to the native globals like the native stack and it produces a fairly universal slowdown of 3% on SS. The only speedups were controlflow-recursive (2.2%) and regexp (1.4%). The main source of extra computation is maintaining an extra BitSet (which gets filled and reset on every guard). It makes sense that this would be expensive for globals since the index space is the nativeGlobalOffset (which can be wide and sparse), as opposed to the fairly dense "distance from the top" index used for sp- and rp-relative stores. A more dense index would be the index of a global in the gloabl SlotList, but this is far away from the StackFilter.
Unless someone knows a faster way, resolving WONTFIX.
Created attachment 408723 [details] [diff] [review] this time, without the bug; performance-neutral Oops! Nevermind all that blabber about going slower. It was really bugging me how such a tiny change to recording could have such a large impact, so I traced it down to a silly syntactic bug in the constructor that actually turned off *all* sp-/rp-relative redundant store elimination (which, apparently, is a 2-3% win.) Turning them back on the patch is mostly performance neutral. 1ms (2.7%) faster on string-unpack-cloud. So I'll just reopen and leave this here for now...
TM is gone.