Closed Bug 523448 Opened 15 years ago Closed 11 years ago

soup up/split StackFilter to drop redundant writes to globals

Tracking

()

Status:

RESOLVED INVALID

People

(Reporter: luke, Unassigned)

Details

Attachments

(1 file, 1 obsolete file)

faster! but only on opposite day 15 years ago Luke Wagner [:luke] 7.22 KB, patch		Details \| Diff \| Splinter Review
this time, without the bug; performance-neutral 15 years ago Luke Wagner [:luke] 7.23 KB, patch		Details \| Diff \| Splinter Review

Luke Wagner [:luke]

Reporter

Description

•

15 years ago

We currently emit LIR to write back all writes to the stack and globals. To speed this up, the StackFilter drops redundant stores to the same stack location (modulo instructions like LIR_xbarrier).  Writes to globals could be optimized in the same way.  The StackFilter also has a writes-over-the-high-watermark optimization which only makes sense for the stack, so it might make sense to split this into two optimization pieces: one that drops writes-over-the-high-water-mark (StackFilter) and one that drops redundant writes to the stack and globals (StoreFilter?).

I have no idea how much redundant writes to globals are costing us right now; I'm just recording this in a bug.

Nicholas Nethercote [inactive]

Comment 1

•

15 years ago

FWIW, StackFilter has a non-negligible cost.  We used to run StackFilter twice, once for SP and once for RP.  Bug 515138 merged those two, saving a few ms in SS.

Luke Wagner [:luke]

Reporter

Comment 2

•

15 years ago

(In reply to comment #1)
> FWIW, StackFilter has a non-negligible cost.  We used to run StackFilter twice,
> once for SP and once for RP.  Bug 515138 merged those two, saving a few ms in
> SS.

So, was it faster because of the few less virtual dispatches?  If that's the case, then we can just hack it into the single StackFilter.  But more than that, I would suggest, orthogonal to this bug, that we use recursive templates to remove all virtual dispatch (allowing the stages to be inlined).  I heard Andreas already tried this, and it didn't matter, which is why I didn't think that splitting one pass into two would make any measurable difference.

Either way, if we are saving a few writes to memory on every iteration of a big loop, I think it would be well worth the extra record time.  What I, or someobody, needs to do is just measure it for, say, our favorite benchmarks.

Nicholas Nethercote [inactive]

Comment 3

•

15 years ago

I don't know if the virtualness of the dispatch is important.  Either way, merging the two StackFilters was a win because it removed a stage in the pipeline.  Ie. for every LIR instruction it removed a read() function call, and then one or more isStore/isGuard tests.  And those tests are highly unpredictable.

As for the recursive template idea, it's been tried for both the reader and writer pipelines (I can't remember the bug numbers) and didn't speed things up in either case.  And it's horribly inflexible, eg. it makes adding print stages really difficult.  And the code looks awful too.

If we really wanted to slim down the reader pipeline, bug 515305 is the way to go, you won't be able to better than that -- I've effectively manually inlined the entire pipeline, taking advantage of domain-specific knowledge that no compiler has -- but Andreas thought it was too inflexible.

And yes, benchmarking will indicate if the runtime benefit is worth the extra compile cost.

Luke Wagner [:luke]

Reporter

Comment 4

•

15 years ago

(In reply to comment #3)
> As for the recursive template idea, it's been tried for both the reader and
> writer pipelines (I can't remember the bug numbers) and didn't speed things up
> in either case.  And it's horribly inflexible, eg. it makes adding print stages
> really difficult.  And the code looks awful too.

If done correctly, it would look like this:

#ifdef DEBUG
lir = new VerboseStage1<VerboseStage2<Stage1<Stage2<Stage3<> > > > >(lirbuf);
#else
lir = new Stage1<Stage2<Stage3<> > >(lirbuf);
#endif

Is that so inflexible and awful looking?  I mean, its not pretty by any stretch of the imagination, but hardly warranting such diatribe.

Inlining would remove the virtual function calls betweens stages, and we would maintain the logical separation of stages, losing only the ability to do runtime pipeline reconfiguration.  True, bug 515305 is able to hand-merge the stages into a single agglomerate, allowing us to avoid redundant unpredictable branches, but that's a separate decision of "do we wish to have a modular pipeline or a big lump o' code", whose answer may very well be "lump o' code is fine, thank you very much".

Nicholas Nethercote [inactive]

Comment 5

•

15 years ago

Another problem was that lots of code has to be moved into .h files because GCC doesn't have the 'export' keyword.  It's also occasionally useful to turn off a phase by hand and that requires a lot more text editing with the template version.

But more generally, for the case I worked on (I can't remember if it was the reader or writer pipeline) the template version was slower, so the aesthetics discussion was moot.  I'm no C++ template guru so perhaps it could be done better, I'd be happy to be proven wrong.

Brendan Eich [:brendan]

Comment 6

•

15 years ago

(In reply to comment #4)
> If done correctly, it would look like this:
> 
> #ifdef DEBUG
> lir = new VerboseStage1<VerboseStage2<Stage1<Stage2<Stage3<> > > > >(lirbuf);
> #else
> lir = new Stage1<Stage2<Stage3<> > >(lirbuf);
> #endif

Nit on C++ style: so even if 0x fixes this, we're stuck with the " >" -- in which case I advocate that our style guide specify "< " to balance. The above is harder to read because of the asymmetric spacing.

/be

Luke Wagner [:luke]

Reporter

Comment 7

•

15 years ago

(In reply to comment #5)
> Another problem was that lots of code has to be moved into .h files because GCC
> doesn't have the 'export' keyword.

Yeah, that part isn't fun.  To be fair, only EDG has export; apparently its quite hard to implement.

> It's also occasionally useful to turn off a phase by hand and that requires a
> lot more text editing with the template version.

More than modifying the two instantiation lines I listed above?

I'll stop hypothesizing for now and take a shot after some current tasks complete.

(In reply to comment #6)
That makes sense.  Should it go in the style guide?

Luke Wagner [:luke]

Reporter

Comment 8

•

15 years ago

(In reply to comment #7)
> > It's also occasionally useful to turn off a phase by hand and that requires a
> > lot more text editing with the template version.
> 
> More than modifying the two instantiation lines I listed above?

Ah, |if (js_LogController.lcbits & LC_TMRecorder) add VerboseWriter| and |if (nanojit::AvmCore::config.soft_float) add SoftFloatFilter|... I see.  That doesn't fit so well in a statically-defined pipeline :P.

Luke Wagner [:luke]

Reporter

Comment 9

•

15 years ago

Attached patch faster! but only on opposite day (obsolete) — Details — Splinter Review

The attached patch is pretty simple; it just lets the StackFilter kill dead stores to the native globals like the native stack and it produces a fairly universal slowdown of 3% on SS.  The only speedups were controlflow-recursive (2.2%) and regexp (1.4%).

The main source of extra computation is maintaining an extra BitSet (which gets filled and reset on every guard).  It makes sense that this would be expensive for globals since the index space is the nativeGlobalOffset (which can be wide and sparse), as opposed to the fairly dense "distance from the top" index used for sp- and rp-relative stores.  A more dense index would be the index of a global in the gloabl SlotList, but this is far away from the StackFilter.

Luke Wagner [:luke]

Reporter

Comment 10

•

15 years ago

Unless someone knows a faster way, resolving WONTFIX.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → WONTFIX

Luke Wagner [:luke]

Reporter

Comment 11

•

15 years ago

Attached patch this time, without the bug; performance-neutral — Details — Splinter Review

Oops!  Nevermind all that blabber about going slower.  It was really bugging me how such a tiny change to recording could have such a large impact, so I traced it down to a silly syntactic bug in the constructor that actually turned off *all* sp-/rp-relative redundant store elimination (which, apparently, is a 2-3% win.)  Turning them back on the patch is mostly performance neutral. 1ms (2.7%) faster on string-unpack-cloud.

So I'll just reopen and leave this here for now...

Attachment #408688 - Attachment is obsolete: true

Luke Wagner [:luke]

Reporter

Updated

•

15 years ago

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Jan de Mooij [:jandem]

Comment 12

•

11 years ago

TM is gone.

Status: REOPENED → RESOLVED

Closed: 15 years ago → 11 years ago

Resolution: --- → INVALID

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

soup up/split StackFilter to drop redundant writes to globals

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: luke, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Attachment

General

Description

File Name

Content Type