Closed Bug 569321 Opened 14 years ago Closed 14 years ago

VM should optimize rest arguments

Categories

(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: lhansen, Assigned: lhansen)

References

Details

(Whiteboard: PACMAN, has-patch)

Attachments

(9 files, 9 obsolete files)

2.48 KB, text/plain
Details
2.47 KB, text/plain
Details
2.50 KB, text/plain
Details
2.50 KB, text/plain
Details
4.02 KB, patch
stejohns
: review+
brbaker
: review+
Details | Diff | Splinter Review
38.63 KB, patch
edwsmith
: review+
Details | Diff | Splinter Review
1.72 KB, patch
Details | Diff | Splinter Review
878 bytes, patch
Details | Diff | Splinter Review
796 bytes, patch
brbaker
: review+
Details | Diff | Splinter Review
Investigation suggests it's possible for the VM to optimize the use of ...rest arguments so as to avoid allocating the rest arguments array in common cases, by means of a lightweight analysis coupled with a fallback.

Benchmarks suggest that this optimization significantly speeds up functions that use rest arguments, especially if those functions are early-bound (15x for simple cases) but also if they're found in the prototype chain (2x).
Attached patch Work in progress (obsolete) — Splinter Review
This sketch is wrong for numerous reasons but provides the overall flavor: the verifier, or perhaps another front end stage, recognizes at low cost functions whose use of the rest arguments is benign.  LIR generation then generates the new internal restarg/restargc instructions and makes the prologue not allocate the rest array immediately.  The implementations of restarg/restargc access the unconsed representation normally, but if run-time usage turns out to violate the assumptions the restarg instruction will cons up the array, and the new instructions will subsequently access that instead.

The reasons this patch is wrong are at least:
- the analysis in the verifier is too lenient
- the fallback cases in restarg/restargc are not implemented
- the way i emit code in the verifier's case for getproperty is wrong, because the verifier crashes subsequently.
Attached patch Work in progress (obsolete) — Splinter Review
Promoted myself from "novice" to "bumbling amateur" on the JIT.  This works on simple cases, other objections still stand.
Attachment #448496 - Attachment is obsolete: true
It would be interesting to see the assembly output from the newly-optimized cases.
(In reply to comment #3)
> It would be interesting to see the assembly output from the newly-optimized
> cases.

I'll see what I can do.

The current generated code calls out to C++ helper functions always, but if we could prove that rest[n] is accessed with n always a valid Uint32 value then we could do significantly better: a couple of instructions each for length and element access.  The verifier might have that information for typed code and interesting access patterns (I worry about 'int' vs 'uint'), so that would be an interesting follow-on patch.
This is ready for feedback / preliminary review:

- It passes the test suite on Mac in R and DD modes
- It doubles the performance on the untyped 'push' benchmark (from 27 to 54 iterations, Mac Pro) and improves the performance for calls to early-bound functions that use rest arguments in "nice" ways by a factor of 10 (from 50 to 475 iterations).  The discrepancy in gains is due to the very expensive calls to prototype-bound methods; once we clean that up, this optimization will matter more for JS-style code.
- It requires no user changes to code, nor recompilation - a big win, really.
- It is reasonably clean and surprisingly small.

Particularly I would be interested in hearing whether the analysis fits into the verifier or whether it should be factored, and if so, how.  I put it there because it was expedient.

The following also remain to be done:

- lazyrest_arg has a FIXME, where it needs to construct a complete multiname before accessing an array that has been consed up when the property name turned out not to be "nice".
- I need to have some test cases for the fallback functionality
- lazyrest_arg should probably allow 'double' property names that turn out to be uint32 values
- Since the verifier is pushing OP_restarg and OP_restargc down the coder pipeline I need to decide whether to support those in the wordcode interpreter (and possibly elsewhere? if so, where?) or whether to disable them entirely in non-jit builds.  In principle it would be a real win for the wordcode interpreter to support them but perhaps that's a follow-on bug.
- lazyrest_arg and lazyrest_argc should probably be called restargHelper and restargcHelper, respectively

I will post benchmarks and some generated code subsequently.
Attachment #448556 - Attachment is obsolete: true
Attachment #448735 - Flags: feedback?(stejohns)
Attachment #448735 - Flags: feedback?(edwsmith)
This was derived from the untyped 'push' benchmark.
Compile with -strict.

Performance (Mac Pro):
Without the optimization: 50 iterations/sec
With the optimization: 475 iterations/sec
Directly comparable to the previous benchmark, this represents the best case for the optimization.
Compile with -strict.

Performance (Mac Pro): 892 iterations/sec.

The 2x advantage of this benchmark over the previous one with the optimization enabled shows that there are further gains to be made if we extend the analysis so that we avoid the subroutine calls to the helpers when we know that all property names used to access the rest array are Uint32 values.  In the case of the previous benchmark, the value is a constant - that's probably not terribly common.  But Array.prototype.push uses a uint-typed variable to access the array, and the optimization would apply in that case.
The "untyped push benchmark" referenced earlier is in test/performance/jsmicro/array-push-1.as.  Be careful how you run that - normally the performance tests are compiled with -AS3 and so these benchmarks will call AS3 methods, not prototype methods.  The AS3 method is native-coded and does not create the rest array (that's not possible for the prototype method, since it's generic).  Be sure to pass -f --nooptimize to the test runner.
Search for restargHelper and restargcHelper to see where those calls are generated; compare with the original's call to createRestHelper (I haven't included that code here).

Another possible improvement is visible here; right now restargcHelper returns an atom always, but in performance contexts, like for Array.prototype.push, it will be converted to int or uint.  It would probably be better for the function to just return uint, that saves time both in the function and in important callers.
Design caution:  its imperative that any thing happening in the verifier, not cause user visible changes to the types or nullability of the data model.  As long as any analysis that piggyback's the verifier's analysis is passive, then its okay.  "passive" means it can't make the verifier fail, and it can't affect the Value.traits or Value.notNull fields in FrameState.   From what I can tell, this patch honors the contract.

replacing OBJ.length with restargc is a kind of early binding and we should think of it that way -- you should get faster code with the same type effects.  (early binding to Array.length should yeild uint, for example).

Could the analysis be done within a CodeWriter?  we have pending work to separate verification from jit-compiling, and it will be easier to tease apart the code if piggyback analysis is well encapsulated.  (but, if the refactoring is awkward, comments are enough).  I do see some annoyances... for example OP_getlocal calls CW.write() instead of writeOp1, forcing analyzers to re-parse abc to get the local var #.

if copies (getlocal/setlocal, etc) propagated the Value.isRestArray bit, would we still need the produceRest/consumeRest accounting?  the accounting is a simple/conservative escape analysis right?  (thinking about how its assumptions might be violate-able)

does produce/consumeRest stay valid in the face of exception handling?  (thinking out loud -- stack contents are erased along the exception edge, so is it enough to zero produceRest/consumeRest at the start of the catch block?  (should produce/consumeRest be fields in FrameState and participate in control-flow merge logic?)  e.g.  OP_getlocal<rest>; OP_jump; ... OP_getproperty<length>

Since restVar is a local variable, it is visible to an active debugger stepping through the code.  debugging should probably just disable the whole optimization.
Attachment #448735 - Flags: feedback?(edwsmith) → feedback+
(In reply to comment #8)
> The "untyped push benchmark" referenced earlier is in
> test/performance/jsmicro/array-push-1.as.  Be careful how you run that -
> normally the performance tests are compiled with -AS3 and so these benchmarks
> will call AS3 methods, not prototype methods.  The AS3 method is native-coded
> and does not create the rest array (that's not possible for the prototype
> method, since it's generic).  Be sure to pass -f --nooptimize to the test
> runner.

You can add a file with asc directives that should allow you to specify that this file should not be optimized. Something like this /should/ do what you want:

file: test/performance/jsmicro/array-push-1.as.asc_args
contents: merge| -no-optimize

example: test\acceptance\regress\vector_domain_bug.as.asc_args
Comment on attachment 448735 [details] [diff] [review]
Functional, performant, nearly complete

nits:

-- disabling the optimization for debugger makes sense, but why disable for the sampler? Trying to profile code that has this as a bottleneck will be misleading...

-- why is NO_VAR a #define rather than a well-typed const?

-- since lazyRest() is only valid post-verification, adding an assert that the MethodInfo has been verified would be nice.

-- instead of multiname.getName()->equalsLatin1("length"), you can use multiname.getName() == core->klength, as getName() will be interned.

-- in lazyrest_argc, we assume that argc will always fit into an intptr atom... is it possible for it to be greater than 2^28? (seems unlikely, but is it *possible*?)

-- meta-question: everywhere we deal with argc, it's a (signed) int, meaning we have to check for < 0... this seems insane, and while it's definitely outside the scope of this bug, it seems nutty that it isn't uint32_t...
Attachment #448735 - Flags: feedback?(stejohns) → feedback+
(In reply to comment #12)
> (From update of attachment 448735 [details] [diff] [review])
> nits:
> 
> -- disabling the optimization for debugger makes sense, but why disable for the
> sampler? Trying to profile code that has this as a bottleneck will be
> misleading...

Sampler != profiler, i believe.  Anyway the reason I did this is because the sampler 'callback' test (yes, that one again) fails with the optimization enabled, and I assumed without investigating too deeply that the sampler was grubbing around in state on the C++ level.  I should probably dig a little deeper.

> -- why is NO_VAR a #define rather than a well-typed const?

WILLFIX.

> -- since lazyRest() is only valid post-verification, adding an assert that the
> MethodInfo has been verified would be nice.

Good idea.

> -- instead of multiname.getName()->equalsLatin1("length"), you can use
> multiname.getName() == core->klength, as getName() will be interned.

Thank you.

> -- in lazyrest_argc, we assume that argc will always fit into an intptr atom...
> is it possible for it to be greater than 2^28? (seems unlikely, but is it
> *possible*?)

In the vicinity of 2^30 array elements is probably possible on 32-bit systems, but the maximum value we will have here is tied to the maximum number of arguments that can be passed to a function, and I assumed - again without digging - that that is much lower.  Will investigate if nobody provides an answer.

> -- meta-question: everywhere we deal with argc, it's a (signed) int, meaning we
> have to check for < 0... this seems insane, and while it's definitely outside
> the scope of this bug, it seems nutty that it isn't uint32_t...

Yes, I think the right thing to do is to morph the argc variable to AS3 'uint'.
(In reply to comment #10)
> Design caution:  its imperative that any thing happening in the verifier, not
> cause user visible changes to the types or nullability of the data model.  As
> long as any analysis that piggyback's the verifier's analysis is passive, then
> its okay.  "passive" means it can't make the verifier fail, and it can't affect
> the Value.traits or Value.notNull fields in FrameState.   From what I can tell,
> this patch honors the contract.

"If you lie to the computer, it will get you".  Yes, this has been on my mind and was the root cause of many of my headaches along the way, and in the current design I'm reasonably happy.  I still lie a little bit - the restargs local is marked not-null even if it is 0, the reasoning being that that is not observable to the program, and it ensures that pass1 and pass2 do not diverge.

> replacing OBJ.length with restargc is a kind of early binding and we should
> think of it that way -- you should get faster code with the same type effects. 
> (early binding to Array.length should yeild uint, for example).

At the moment I model this by returning Atom from restargcHelper and then coercing to whatever type the context expects; is this kosher?

> Could the analysis be done within a CodeWriter?  we have pending work to
> separate verification from jit-compiling, and it will be easier to tease apart
> the code if piggyback analysis is well encapsulated.  (but, if the refactoring
> is awkward, comments are enough).  I do see some annoyances... for example
> OP_getlocal calls CW.write() instead of writeOp1, forcing analyzers to re-parse
> abc to get the local var #.

No opinion on this, I'll probably be inclined to try to land as-is with comments, but I have no good view of the long-term ramifications.

> if copies (getlocal/setlocal, etc) propagated the Value.isRestArray bit, would
> we still need the produceRest/consumeRest accounting?  the accounting is a
> simple/conservative escape analysis right?  (thinking about how its assumptions
> might be violate-able)
> 
> does produce/consumeRest stay valid in the face of exception handling? 
> (thinking out loud -- stack contents are erased along the exception edge, so is
> it enough to zero produceRest/consumeRest at the start of the catch block? 
> (should produce/consumeRest be fields in FrameState and participate in
> control-flow merge logic?)  e.g.  OP_getlocal<rest>; OP_jump; ...
> OP_getproperty<length>

It's a speculative escape analysis and I'm not yet completely convinced of its correctness.  The produceRest/consumeRest values are a short-hand so that I did not have to add code to every instruction case, but I may have missed some cases: I'm not sure what the effect of 'dup' is, for example.

Right now the analysis requires that if the rest array is produced in a block it is consumed in that block, one consumer for each producer.  (So 'dup' probably needs to conditionally increment produceRest, or bail: set restVar to NO_VAR.)  Going beyond basic block is attractive but not necessary for the basic case, imo, which is to make it possible to use rest args in stylized ways without paying a penalty.  And that being the case, an exception shouldn't be a problem: it prunes the stack (throwing away any unconsumed implicit 'rest'), and enters a block at the beginning, and that block will always produce a rest array before consuming it.

> Since restVar is a local variable, it is visible to an active debugger stepping
> through the code.  debugging should probably just disable the whole
> optimization.

The four lines following the initialization of restVar do just that.

I will follow up tomorrow re the analysis, with some propositions about its correctness.
(In reply to comment #11)
> 
> You can add a file with asc directives that should allow you to specify that
> this file should not be optimized.

Is it possible to have one such file per directory, rather than one per test file?  I can automate the creation of those files, but it just turns into such a mass of files.
Following is the algorithm for the least ambitious but still useful analysis.  Please check it to see if you believe it.  I know it can be extended, but I would much prefer to start with something simple and useful than to try to create something general and much harder to implement.

-----

We aim to determine whether the 'rest array' is used in only two ways:

 - to access the 'length' property with either the public namespace or a public namespace in the namespace set

 - to access an unknown property with either the public namespace or a public namespace in the namespace set

If those are the only uses of the rest array then the rest array is eligible for a speculative optimization that delays or avoids its construction.  (The optimization can be non-speculative if more is known about the property names in the second case but I won't address that further.  In important use cases we won't have that information.)

To determine whether we can enable the optimization we analyze the ABC code for the method, this amounts to a simple escape analysis.  We say the analysis "fails" if the optimization is rejected because we can't prove that the rest array does not escape.

Nested methods cause the array to escape; the analysis fails if the code requires an activation record.  (It is possible to do better.)

A debugger causes the array to escape; the analysis fails if a debugger is present.

The rest array is stored in a distinguished local slot in the ABC code, and we know which slot this is.  We assume ASC does not do
something clever with the array, eg, by moving it to a different local slot.  (It is possible to do better.)

The analysis needs to prove the following:

* If the rest local is read by a GETLOCAL and the rest array value is produced, and if that value is consumed by an instruction (rather  than being discarded by an exception) then that value must be consumed by a GETPROPERTY instruction of one of two forms:

 - It reads a "length" property with a public namespace or a public namespace in its namespace set.

 - It reads a property with a run-time name, with a public namespace or a public namespace in its namespace set.

* If the rest local is read by any other instruction than a GETLOCAL, or is consumed or inspected by any other instruction than a
 GETPROPERTY of the two forms above, then the analysis fails. 

The analysis is fairly weak but handles code that occurs in practice.

A bit, isRestArray, on each Value in the FrameState tracks whether a value is a rest array.  This bit is set by GETLOCAL.  The bit is checked by every other instruction: the operands to the instruction are checked.  If the bit is set but the instruction is not a GETPROPERTY of the correct form then the analysis fails.  (This means the bit does not propagate with the value as the value is duplicated, moved into locals, etc.)

The analysis checks the values that remain in the FrameState at block boundaries.  If isRestArray is set on any value, the analysis fails.  (It is possible to do better but not necessary to do so at this point.)
(In reply to comment #15)
> Is it possible to have one such file per directory, rather than one per test
> file?  I can automate the creation of those files, but it just turns into such
> a mass of files.

Yes you can have a dir.asc_args file and it will be used
I can't find any flaws in the algorithm that fail to meet its aim, which is to identify methods that only use the rest array in those two ways.  however the second way amounts to an escape, which the runtime must handle by lazy-creating the array (right?)

If lazy creating the array happens after some previously early bound code is emitted, the code could be incorrect.  Can you have a looping situation where the analysis succeeds, allowing early binding to rest.length, then runtime creates the array on the fly due to escaping, and then on a second iteration, the early bound length is wrong?

sketch:
  function test(name, ...rest) {
    for (var i=0; i < 2; i++) {
      getlocal #rest
      getproperty public::length  // early binds to argc
      getlocal #rest
      getlocal #name
      getproperty {public,AS3}::[] // does not cause analysis to fail
      pushint 1
      call     // invokes this.AS3::push(1) which modifies length      
  }
  test("push")
(In reply to comment #18)
> I can't find any flaws in the algorithm that fail to meet its aim, which is to
> identify methods that only use the rest array in those two ways.  however the
> second way amounts to an escape, which the runtime must handle by lazy-creating
> the array (right?)

Right.

> If lazy creating the array happens after some previously early bound code is
> emitted, the code could be incorrect.  Can you have a looping situation where
> the analysis succeeds, allowing early binding to rest.length, then runtime
> creates the array on the fly due to escaping, and then on a second iteration,
> the early bound length is wrong?

You can have a loop, but the bug here would be in the proposed implementation of early binding to 'length'.  Every early-bound access to length or to a property access on the object must check whether the rest array has been consed up, and access the array and not the cached information if so.  ie, in the current implementation the code would be

  function test(name, ...rest) {
    for (var i=0; i < 2; i++) {
      getlocal #rest
      restargc
      getlocal #rest
      getlocal #name
      restarg {public,AS3}::[] // does not cause analysis to fail
      pushint 1
      call     // invokes this.AS3::push(1) which modifies length      
  }
  test("push")

where restargc and restarg are pretty much as in the provided patch and dispatch on whether the rest array exists or not.
i buy that.  I didn't look at the code but I beleive the implementation of restargc will be right as long as it handles lazy creation.

For rest to escape via a nested function it would have to be copied to the scope chain via OP_getlocal+OP_pushscope, which would cause analysis to fail.  (not that we're trying to make the analysis more general, just noting).

I think what the analysis is really buying us is avoiding aliasing - no copies of the rest array will exist statically in the frame of the function, if the analysis passes.
(In reply to comment #14)
> > Could the analysis be done within a CodeWriter?  we have pending work to
> > separate verification from jit-compiling, and it will be easier to tease apart
> > the code if piggyback analysis is well encapsulated.  (but, if the refactoring
> > is awkward, comments are enough).  I do see some annoyances... for example
> > OP_getlocal calls CW.write() instead of writeOp1, forcing analyzers to re-parse
> > abc to get the local var #.
> 
> No opinion on this, I'll probably be inclined to try to land as-is with
> comments, but I have no good view of the long-term ramifications.

Just to comment on this some more before I offer the patch for review, I've restructured the analysis to use another CodeWriter, because this (a) removes the hacks with the counters and makes it more credible that the analysis is correct and (b) it makes it possible to do more sophisticated analysis (which I have not implemented yet but am pondering).

However, the current structure of the Verifier and CodeWriter are not friendly here.  The problem is that the Verifier is not simply a verifier, it also performs analysis and rewriting of its own.  In particular, it may decide that a particular property access is early bindable or whatever by its own analysis, and thus rewrite the code in some fashion.  At that point it may have preempted any optimization in any following CodeWriter which would have simply removed or otherwise significantly optimized the code, as I'm trying to do here.  On the other hand, the latter CodeWriter could not come before the Verifier without having to significantly duplicate the Verifier's effort, and even then it could not insert internal-only instructions in the instruction stream.

So the way I've had to do it is to rely on capturing intra-instruction behavior in write, writeOp1, etc, and post-instruction behavior in writeOpcodeVerified, and /then/ I've had to insert code inside the verifier to handle cases where the verifier's rewriting needs to be overridden (for OP_getproperty).

Based on a fairly naive understanding of things it seems to me that the Verifier should not be rewriting code at all, it should be verifying that the ABC is correct; rewriting should be left to later pipeline stages, and that would allow optimization stages to be inserted into the pipeline without having to change the verifier.

(Not a big deal, just observing.)
Attachment #449849 - Attachment description: Benchmark: class with early-bindable → Benchmark: class with early-bindable 'push' that uses varargs and for which the analysis succeeds, but where there's a fallback to the general case at run-time.
(In reply to comment #23)
> Created an attachment (id=449849) [details]
> Benchmark: class with early-bindable

Should have read "Benchmark: class with early-bindable 'push' that uses varargs and for which the analysis succeeds, but where there's a fallback to the general case at run-time."

A variant on this benchmark is to use an index value that is an integer, but out of range, eg "1", the performance profile is rather different (31 iterations/sec rather than 6 iterations/sec).

Generally what these benchmarks show is that if the analysis succeeds, but there's a fallback to allocating the array at run-time, then there's a regression in performance relative to always consing up the varargs in the first place.  Some of that overhead is inherent in calling the helper methods (we see a slowdown from the onearg case to the optimized-varargs case too), some probably comes from suboptimal handling in restargHelper.  It does not seem important, but it could be improved upon.
Attached patch Patch (obsolete) — Splinter Review
Addresses all concerns expressed previously.  Also see previous comments I've made, and benchmarks.

Follow-on work items could be:
 - a stronger escape analysis (less brittle in the face of ASC whims)
 - analysis that makes use of type and range information to avoid calling the helpers
 - more fast paths in restargHelper
 - (your idea here)
Attachment #448735 - Attachment is obsolete: true
Attachment #448749 - Attachment is obsolete: true
Attachment #449852 - Flags: review?(stejohns)
Attachment #449852 - Flags: review?(edwsmith)
Whiteboard: PACMAN → PACMAN, has-patch
One more observation: the main reason the patch is so big is because it has do decode instructions to determine the number of stack slots read.  If we had a table with this information the patch would have been a lot smaller.  The normal attribute table does not have that information; the wordcode interpreter does however, and Chris Brichford suggested merging all these tables together (because of the AOT work and general utility), but bug #545254 has been languishing since late February.
(In reply to comment #21)
> (In reply to comment #14)

> Based on a fairly naive understanding of things it seems to me that the
> Verifier should not be rewriting code at all, it should be verifying that the
> ABC is correct; rewriting should be left to later pipeline stages, and that
> would allow optimization stages to be inserted into the pipeline without having
> to change the verifier.
> 
> (Not a big deal, just observing.)

Completely agree, and that's also the subject of this cluster of bugs:

https://bugzilla.mozilla.org/showdependencytree.cgi?id=540487&hide_resolved=1
Attachment #449849 - Attachment mime type: application/applefile → text/plain
FYI, this item from earlier is still unresolved:

"- Since the verifier is pushing OP_restarg and OP_restargc down the coder
pipeline I need to decide whether to support those in the wordcode interpreter
(and possibly elsewhere? if so, where?) or whether to disable them entirely in
non-jit builds.  In principle it would be a real win for the wordcode
interpreter to support them but perhaps that's a follow-on bug."

The most likely fix is to make the optimization jit-only for now and open a bug to track the possibility of implementing the optimization for the wordcode interpreter.  It's an easy fix - disable the optimization in the RestArgAnalyzer's constructor if it's a wordcode build.

Also pending is rebasing to TR tip and a compile fix for Windows (unused parameters).
(In reply to comment #28)
> non-jit builds.  In principle it would be a real win for the wordcode
> interpreter to support them but perhaps that's a follow-on bug."

Brings up a meta-question, namely, what's the future of the wordcode interpreter? It was created to improve performance on JIT-prohibited systems, but at present there's only two of those on our radar (iOS and PS3), of which the former is obviously a nonfactor for us... are we likely to revive wordcode in a shipping system anytime in the nearterm future? Is it worth spending time maintaining?
Let's take the discussion about wordcode futures to email.
Attached patch Patch, v2 (obsolete) — Splinter Review
Virtually the same as the previous patch, but disables the optimization for wordcode and non-jit compiles (in the RestArgsAnalyzer constructor) and fixes a Windows compile problem by removing two unused parameters on RestArgAnalyzer::getProperty.  Also rebased to tr tip.
Attachment #449852 - Attachment is obsolete: true
Attachment #449894 - Flags: review?(stejohns)
Attachment #449894 - Flags: review?(edwsmith)
Attachment #449852 - Flags: review?(stejohns)
Attachment #449852 - Flags: review?(edwsmith)
Comment on attachment 449894 [details] [diff] [review]
Patch, v2

My only concern is that this patch seems somewhat fragile, in that the necessary bits are (of necessity) strewn in various places in Verifier... probably unavoidable, and maybe I'm overreacting. Still, can we make specific acceptance tests for this to ensure it doesn't accidentally break in the future due to unforeseen Verifier tweaks?
Attachment #449894 - Flags: review?(stejohns) → review+
Comment on attachment 449894 [details] [diff] [review]
Patch, v2

Even though the patch is bigger, its nice to get the encapsulation, and the code should shrink when the opcode tables land and whenever we better isolate verification.  

The only hazard I can see is that when a verify exception is thrown, restAnalyzer.cleanup() wont be called.  If you fix it by creating a field in Verifier for restAnalyzer, then ~Verifier can clean up, (optionally) verifyBlock() doesn't need the extra argument.  Or did I miss something?  -- this is an example of a locally allocated CodeWriter, that the comment near the top of Verifier::verify() warns about.

R+ once cleanup is fixed/assured.
Attachment #449894 - Flags: review?(edwsmith) → review+
(In reply to comment #32)
> (From update of attachment 449894 [details] [diff] [review])
> My only concern is that this patch seems somewhat fragile, in that the
> necessary bits are (of necessity) strewn in various places in Verifier...
> probably unavoidable, and maybe I'm overreacting. Still, can we make specific
> acceptance tests for this to ensure it doesn't accidentally break in the future
> due to unforeseen Verifier tweaks?

I think it will be possible to construct a simple benchmark that would reliably show a large gain when the optimization kicks in.  I would drop this into the acceptance tests.  Separate patch coming up for that.
(In reply to comment #32)
> (From update of attachment 449894 [details] [diff] [review])
> My only concern is that this patch seems somewhat fragile, in that the
> necessary bits are (of necessity) strewn in various places in Verifier...

And indeed bugs remain.  I read verifyBlock superficially and assumed it would return at the end of a block, but this is not so: it only returns if the block ends unconditionally, by JUMP, LOOKUPSWITCH, RETURN, and THROW - not on conditional jumps.  Yet the soundness of the analysis is predicated on no unconsumed rest arguments remaining on the stack at any jump (because states are not merged).

The correct thing to do is to move the calls to endBlock out of the verifier main loop and into the analysis itself, in writeOpcodeVerified most likely.
Attached patch Patch, v3 (obsolete) — Splinter Review
One final(?) round.  Changes here relative to the previous version are:

- The definition of RestArgAnalyzer has moved into Verifier.h
- RestArgAnalyzer is a field on Verifier, with two-stage setup (trivial constructor + init)
- RestArgAnalyzer::cleanup is gone
- The 'analyzer' arg on verifyBlock is gone
- The hooking of restArgAnalyzer into the pipeline has been abstracted in RestArgAnalyzer::hookup
- The calls to endBlock have been moved into restArgAnalyzer::writeOpcodeVerified
- The assert in NullWriter::NullWriter has been removed because (a) it's in the way and (b) if it's a condition for this field to be not-null then it should also be const.  LMK if this change is a hardship.
- In RestArgAnalyser, optimize is set to false during analysis by calling a new method 'fail', this aids debugging.

In other words, cleanup should now function correctly and the end-of-block condition on the analysis phase should be adhered to properly.
Attachment #449894 - Attachment is obsolete: true
Attachment #450079 - Flags: review?(edwsmith)
Attached patch Acceptance testSplinter Review
Measures two programs: one where the optimization must kick in, one where it won't.  The ratio of the v8 score of the latter to the former must be at least 5.  I will post a follow-up change to testconfig.txt that disables this test for debugger builds, since the debugger disables the optimization.

Though this test measures performance I put it in the acceptance tests (it goes into as3/Definitions/Function/RestOptimization.as).  Feel free to argue, but I felt that this is not primarily a performance test, but a test that the VM does what it's supposed to do with a specific type of progam.
Attachment #450087 - Flags: review?(stejohns)
Attachment #450087 - Flags: review?(brbaker)
Attached patch Testconfig patch (obsolete) — Splinter Review
Attachment #450088 - Flags: review?(brbaker)
Blocks: 570944
the only argument i can think of in favor of the test being a performance test would be if it is not reliable... sampler acceptance tests are notorious this way.  if it runs in a reasonable time (all platform/configs) and isn't flaky then its fine in acceptance tests.  IIRC, the spidermonkey regression tests have at least one test that tries to ensure the big-O performance of Array.sort is O(NlogN).  this new one is the same idea.
(In reply to comment #36)
> - The assert in NullWriter::NullWriter has been removed because (a) it's in the
> way and (b) if it's a condition for this field to be not-null then it should
> also be const.  LMK if this change is a hardship.

Presumably you still utilize some default methods of NullWriter, and coder becomes nonnull at lazy-init time? cool; otherwise extending CodeWriter would be fine.  no hardship.  the assert would only mask an immediate NPE in broken code anyway, limited value.
Attachment #450079 - Flags: review?(edwsmith) → review+
Comment on attachment 450079 [details] [diff] [review]
Patch, v3

I'm sorry I didn't review the jit code more carefully earlier, i was obsessed about the analyze phase.

Curious, why the goto in restargcHelper() instead of:

   if (*restLocal == 0)
      return argc;
   return /* call get_length */

keeping style similar to restargHelper()?

passing restLocal by reference to restargcHelper() means that a call to restargcHelper() must be modelled by dead-variable removal as a "read"; without that modelling, stores to that variable might be found to be dead and incorrectly removed.  (odds are, there are other reads.  but with loops and wonky control flow path, this is not an area to count on luck).

This modelling is done in analyze_call().  An example of another function that passes vars by reference and is modelled as a 'read' is makeatom(), at the bottom of analyze_call().

An alternative for restargcHelper(), is to pass the restLocal variable by value, loaded with "LIR_ldp" (pointer-sized value).  Then, deadvars analysis will see the load and get the modelling right based on that.

restargHelper() both reads and writes to restLocal, so it should be modelled as such as well, which is a little trickier.  The only precedent like this is ordinary calls in debuggable code - they're modelled as a read and write to every local.  we need to model the read from restLocal in analyze_call(), and the [possible] write needs to be modelled in VarTracker::insCall() by clearing the remembered value in vars[].  (set vars[param_count] = 0 if we see a call to restargHelper()).

in CodegenLIR::writeEpilogue(), add 

    if (restArgc) livep(restArgc)

to keep the stack slot live for the whole method; this is required when the method contains loop and the last ordinary use is inside the loop.  (must extend the lifetime of the slot to end-of-loop).
(on the "wishlist" is a debug-only analyzer that spots this category of mistake.. we dont do live-range analysis automatically because its too slow and is only needed for a handful of jit-generated values).

In writing this, its a refreshing reminder that the jit is too fragile.  but it is what it is.
A corrolory to my above comment is that any modelling/mutation of vars[] should be accompanied by corresponding modelling/mutation of tags[].  I racked my brain on this a bit -- i think restarg optimization won't happen in any situation where tags[] has a meaningful value... but again, thats a fragile assumption.  

a scenario where tags[restLocal] read is:

  if ... 
    restLocal = OP_newarray
  else
    restLocal = /* something else */
  print(restLocal[1])

the reference in the control-flow-join block causes tags[] to be read, if the two assignments aren't the same type.  (SST_object vs SST_atom, for example).  I think the analysis would reject this code tho, so just noting.

During verify phase 1, this situation can be detected when FrameState.value(restLocal).sst_mask contains more than one bit set.

two cheap insurance policies would be:
   1. detect two bits set in sst_mask, and fail() in that case
   2. synronize modelling of restarg helpers to both vars[restLocal] and tags[restLoca] in the jit.

1 is cheaper than 2, probably.
Following up to the last four comments:

We get an order of magnitude improvement with the optimization in release mode and much more in debug mode so I expect a factor of five cutoff to be safe.  (Famous last words.)  I need to update the testconfig patch however because the test will fail in the interpreter.

Piggy-backing on NullWriter is convenient, I only implement five functions of CodeWriter.

Fixing restargcHelper to take the array value (or NULL) by-value is clearly the right thing.

I will model reads and writes for restargHelper as this seems like the right thing to do.

As discussed on IM, restLocal is written once and its type is set to ARRAY_TYPE.  The value is either a real array or 0.  If the value is 0, jitted code reads the value to pass it to restargcHelper, but never writes the value.  Thus I'm pretty sure I don't need to worry about consistency in the sense discussed in comment #42.

New patch coming.
Attached patch Patch, v4Splinter Review
(And once more, with feeling.)

Made the fixes outlined in my previous comment, no other changes.
Attachment #450079 - Attachment is obsolete: true
Attachment #450174 - Attachment is patch: true
Attachment #450174 - Attachment mime type: application/octet-stream → text/plain
Attachment #450174 - Flags: review?(edwsmith)
Attached patch Testconfig patch, v2 (obsolete) — Splinter Review
Also disable for interpreter runs
Attachment #450088 - Attachment is obsolete: true
Attachment #450176 - Flags: review?(brbaker)
Attachment #450088 - Flags: review?(brbaker)
Comment on attachment 450174 [details] [diff] [review]
Patch, v4

substance:

as discussed on IM, CodegenLir::restArgc is defined in the prolog and only read from elsewhere, so it doesn't need to be allocated storage, its just an SSA expression.

ship it!

nits:

jit-calls.h: 864:  if (restLocal == 0) // Array has been created lazily

missing "not" in the comment?

jit-call.h: 913; could simplify toplevel->toVTable(obj) to (*restLocal)->vtable, if desired.  we know obj is a non-null ScriptObject.  since this is the slow path, whatever, but avoids a call and switch before calling getproperty (which is a monster)

* restLocal, firstLocal, and param_count()+1 used at various different places in CodegenLIR.cpp; it probably should be a field in CodegenLIR and calculated once in writeProlog (but, still have to pass it around a couple places, like to VarTracker).
Attachment #450174 - Flags: review?(edwsmith) → review+
Component: Virtual Machine → JIT Compiler (NanoJIT)
QA Contact: vm → nanojit
Blocks: 420368
Blocks: 477139
Attachment #450087 - Flags: review?(stejohns) → review+
Fixed restArgc issue.
Fixed jit-calls.h nits.
Introduced a restLocal property on CodegenLIR, that clears up many remaining nits.

tamarin-redux changeset:4790:293e97059b2a, including test case and testconfig changes (awaiting Brent's review).
Blocks: 571468
Blocks: 571469
Note, a fix for 64-bit systems was posted in 4791 (LIR_livep should be LIR_livei) and then the optimization was disabled in 4792 while we investigate problems on PPC, ARM, and MIPS.
Remember to re-enable the acceptance test when re-enabling the optimization.
This will land once it's been tested in the sandbox.
Two more fixes noticed:

Toplevel* isn't constant with respect to code; loadEnvToplevel() produces a possibly hoisted Toplevel*.

Must convert index argument to Atom before calling helper (opportunity to specialize the helper for int/uint/double in the future, to avoid native->atom conversion).
Comment on attachment 450176 [details] [diff] [review]
Testconfig patch, v2

There is an acceptance run in the deep phase of the build that runs a debugger build with the -Dnodebugger switch. This config /should/ optimize the restargs and therefore should be able to properly run the testcase. The config string for those runs looks like .*debugger-Dnodebugger
Attachment #450176 - Flags: review?(brbaker) → review-
Attachment #450087 - Flags: review?(brbaker) → review+
Pushed my bug fix, Ed's bug fix, reenabled the optimization, but left the acceptance test disabled until I can properly debug the test selection:

tamarin-redux changeset:   4812:1c8482dc0b74
(A background paper on design alternatives etc for this optimization was added to the "Papers, Demos, etc" section on the ActionScript Engineering home page on zerowing.)
Trying again.
Attachment #450176 - Attachment is obsolete: true
Attachment #452263 - Flags: review?(brbaker)
Attachment #452263 - Flags: review?(brbaker) → review+
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Flags: flashplayer-bug+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: