local var ref slower than global, potentially due to coerce_a



Virtual Machine
7 years ago
7 years ago


(Reporter: pnkfelix, Unassigned)


(Blocks: 1 bug)



(3 attachments)

Created attachment 575036 [details]
AS microbench F.as: 1shot with global (fast)

Performance issue that Felix noticed today

For some informal background, scroll down to the 08:56 timestamp here:


Here's the summary: sometimes, in a nested for loop of the form:

    for (i=0; i < icount; i++) {
        for (j=0; j < jcount; j++) {

There are inner references (reads and writes) to 'j'; those references could be to a global variable 'j' declared at the toplevel, or to a local variable declared within the function.

Felix would expect the local declaration of 'j' to be at least as fast as the global and probably faster.  But in many (admittedly not all, but many) of his tests, he found it to be significantly slower.

At the moment, from examining the generated byte code and jit'ted code, Felix's best guess is that the slow-down in the case of the local-declared 'j' is due to the introduction of a coerce_a in the inner loop that is not present when 'j' is globally declared.

In other words, here are the abcdumps of the inner loops of both the fast version and the slow version:

Globally declared 'j', Seemingly fast:
  39        label         	
  40        getglobalscope	
  41        getslot       	1
  43        increment     	
  44        setlocal      	5
  46        getlocal      	5
  48        getglobalscope	
  49        swap          	
  50        setslot       	1
  52        kill          	5
  54        getglobalscope	
  55        getslot       	1
  57        pushshort     	1000
  60        iflt          	L4

Locally declared 'j', Seemingly slow:
  42        label         	
  43        getlocal      	4
  45        increment     	
  46        coerce_a      	
  47        setlocal      	4

  49        getlocal      	4
  51        pushshort     	1000
  54        iflt          	L4

Felix suspects the coerce_a is expensive because it expands into a callout to the doubleToAtom_sse2 function; skimming the jit-generated code for the Globally declared 'j' does not seem to contain any analogous call-outs.

This problem is very tricky because it appears that the presence/absence of the coerce_a is itself subject to the whims of asc.jar.  Relatively small changes to Felix's test (e.g. refactoring the code so that the inner loop is in its own function) seem like they cause the coerce_a to be replaced in the bytecode by a convert_d (which then executes fast and thus removes the regression).
Created attachment 575037 [details]
AS microbench G.as: 1shot with local (slow)
Created attachment 575042 [details]
AS microbench S.as: named with local (fast; no coerce_a)

This file is a variant on G.as which does not exhibit the regression and serves as an illustration of the trickiness mentioned at the end of comment 0.  In particular, the "only" diffrence between the G.as and S.as is that in G.as, the function definition appears in an expression context (and thus is itself an expression) and is immediately invoked.

In S.as, the function appears in statement context [*], and is invoked in a subsequent statement.

Despite the relatively minor differences in the two source files (at least at the lexical level if not the grammatical or semantic level), the generated bytecode is significantly different: the inner loop for S.as looks like:

  39        label         	
  40        getlocal      	4
  42        increment     	
  43        convert_d     	
  44        setlocal      	4
  46        getlocal      	4
  48        pushshort     	1000
  51        iflt          	L4

(note that there is no coerce_a; instead the convert_d opcode appears after the increment.)

And as noted at the outset, S.as does not exhibit the performance regression described in comment 0.  (I plan to post concrete numbers in comment 3.)

[*] Terminology Question: Is "statement context" not the right term here?  Would "definition context" or "toplevel context" be better?
Concrete performance data, from fklockii-iMac (2.66 Ghz Intel Core i5), 32-bit Release avmshell build from rev 6731:4dbf526976e9

% $AVM f.abc 
1shot fcn with global, n2-n1:4769ns
% $AVM g.abc
1shot fcn with  local, n2-n1:15917ns
% $AVM s.abc
named fcn with  local, n2-n1:4047ns
Assuming coerce_a is indeed the culprit, there are two obvious paths from here:

  (1.) Change asc.jar to avoid emitting coerce_a in this case, if possible, or

  (2.) Change AVM so that coerce_a is faster.

In the absence of other information, I'm guessing (2.) is more likely to bear fruit in the short term.

Comment 5

7 years ago
The appearance of coerce_a is almost always bad news.  It means a boxing conversion (doubleToAtom, need not always allocate a box and won't do it in this case but it is not cheap in any case) and also tag checking and possibly slow paths when we come around the loop again.

IMO this is clearly an ASC bug, and IMO also a fairly bad one - ASC must not be allowed to throw away type information like it appears to do here.  We should try to fix it.  We will be shipping a new ASC to support float/float4, and we should consider fixing the bug in that version.

Comment 6

7 years ago
Judging from a larger abcdump context, it appears ASC does not perform type analysis on function expression bodies but just assumes everything is "Atom".  Ouch.


7 years ago
Blocks: 639475
(In reply to Lars T Hansen from comment #6)
> Judging from a larger abcdump context, it appears ASC does not perform type
> analysis on function expression bodies but just assumes everything is
> "Atom".

Yes, I think this explains all of the situations where I had thought "small" and "orthogonal" changes to the input were hiding the regression.  (The changes were, in hindsight, neither small nor orthogonal, at least at the parse-tree level; they all involved, in part, lifting the function-expression up to a function-definition, which I had assumed, ha ha, was a safe refactoring.)
You need to log in before you can comment on or make changes to this bug.