IonMonkey: discardPhhiAt/EliminatePhis has horrible behavior

RESOLVED FIXED in mozilla21

Status

()

Core
JavaScript Engine
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: mjrosenb, Unassigned)

Tracking

unspecified
mozilla21
All
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [ion:t])

Attachments

(2 attachments, 2 obsolete attachments)

(Reporter)

Description

6 years ago
Created attachment 618789 [details]
xz compressed js bench that exhibits horrible compilation times

on the emscripten-compiled code for sqlite, we end up spending several hours compiling.
the lion's share of time is spent in a single function, replaceOperand.
    87.97%       js  js                 [.] js::ion::MNode::replaceOperand(unsigned long, js::ion::MDefinition*)
     3.37%       js  js                 [.] js::ion::EliminatePhis(js::ion::MIRGraph&)
     1.33%       js  js                 [.] js::ion::LinearScanAllocator::allocateRegisters()
     0.79%       js  js                 [.] _ZN2js7analyze14ScriptAnalysis17mergeBranchTargetEP9JSContextRNS1_12SSAValueInfoEjRKNS_6VectorIjLm0ENS_15TempAllocPolicyEEEj.part.73
I believe that most of the time is from a single caller, as shown in this backtrace:
#0  0x000000000068b1f1 in js::ion::MNode::replaceOperand (this=0x1004b2e88, index=1217, def=0x0)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/ion/MIR.cpp:274
#1  0x00000000006d1cfd in js::ion::MBasicBlock::discardPhiAt (this=0xda77efc8, at=...)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/ion/MIRGraph.cpp:508
#2  0x00000000006941ec in js::ion::EliminatePhis (graph=...) at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/ion/IonAnalysis.cpp:178
#3  0x00000000006924e8 in TestCompiler (graph=..., builder=...) at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/ion/Ion.cpp:659
#4  IonCompile (cx=0xa5dbc0, script=0x7ffff4468b00, fp=0x7ffff6a98870, osrPc=0xa5e310 "\020\060\372\367\377\177")
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/ion/Ion.cpp:776
#5  0x0000000000692cb4 in Compile (osrPc=0x41d0f03 "\344C\b\377\375\300\247\302\260\065", fp=<optimized out>, script=0x7ffff4468b00, cx=0xa5dbc0)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/ion/Ion.cpp:861
#6  Compile (osrPc=0x41d0f03 "\344C\b\377\375\300\247\302\260\065", fp=<optimized out>, script=0x7ffff4468b00, cx=0xa5dbc0)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/ion/Ion.cpp:871
#7  js::ion::CanEnterAtBranch (cx=0xa5dbc0, script=0x7ffff4468b00, fp=<optimized out>, pc=0x41d0f03 "\344C\b\377\375\300\247\302\260\065")
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/ion/Ion.cpp:889
#8  0x00000000004a1b23 in js::Interpret (cx=0xa5dbc0, entryFrame=0x7ffff6a97030, interpMode=js::JSINTERP_NORMAL)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/jsinterp.cpp:1851
#9  0x00000000004a8526 in js::RunScript (cx=0xa5dbc0, script=0x7ffff43b6ba0, fp=0x7ffff6a97030)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/jsinterp.cpp:486
#10 0x00000000004a95b0 in ExecuteKernel (result=0x0, thisv=..., scopeChain=..., script=0x7ffff43b6ba0, cx=0xa5dbc0, type=<optimized out>, 
    evalInFrame=<optimized out>) at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/jsinterp.cpp:685
#11 js::Execute (cx=0xa5dbc0, script=0x7ffff43b6ba0, scopeChainArg=..., rval=0x0) at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/jsinterp.cpp:727
#12 0x000000000041fcf6 in JS_ExecuteScript (cx=0xa5dbc0, obj=0x7ffff6803060, scriptArg=<optimized out>, rval=0x0)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/jsapi.cpp:5276
#13 0x000000000040dd58 in Process (cx=0xa5dbc0, obj=0x7ffff6803060, filename=<optimized out>, forceTTY=<optimized out>)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/shell/js.cpp:480
#14 0x0000000000410f73 in ProcessArgs (op=0x7fffffffe2c0, obj=0x7ffff6803060, cx=0xa5dbc0)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/shell/js.cpp:4777
#15 Shell (cx=0xa5dbc0, op=0x7fffffffe2c0, envp=<optimized out>) at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/shell/js.cpp:4860
#16 0x0000000000404396 in main (argc=<optimized out>, argv=<optimized out>, envp=0x7fffffffe4a8)
    at /home/mrosenberg/src/ionmonkey/ionmonkey-dev/js/src/shell/js.cpp:5092

With an optimized build, the function at line 141481 takes 205 seconds to compile, and the function at line 58976 take 11 seconds to compile, both of which we do many times.
Heh, this might have been an area where lazy phi placement actually helped.

Maybe we should hold off on Emscripten until we have chunked compilation working. replaceOperand is O(n) and could easily explode with a huge number of dead phis. I suspect the compiler is going to blow up on large scripts no matter how many little things we fix.

Jan, any chance you could test this one?
(Reporter)

Comment 2

6 years ago
I talked with sstangl about this before, and it looks like the issue is the walk through the use chain to find one particular use.  My idea is to have every element in the operands vector also have a pointer to a node in the use chain, so this lookup becomes an O(1) operation.

If we wish to actively remove elements from the use chains, then we'd either need a doubly linked list or some odd logic to deal with it, but both are doable.
In other words, "MOperand" would be the tuple of (MDefinition, MUse). Then MUses link to MOperands (instead of index), and MOperands link to MUses, so searching through huge MUse lists is no longer necessary. Also lets us stop hardcoding indices.
So this is the best idea we currently have. The goals are to have O(1) replaceOperand(), to not depend on hardcoded indexes, and to not box ourselves into a corner that is slow for deletion.

There is one data structure, an MUse. It contains:
- Pointer to producer MDefinition,
- Pointer to consumer MNode,
- Pointer to previous MUse in producer MUse list,
- Pointer to next node MUse in producer MUse list.

This is slightly more memory than currently used; if we drop the previous pointer, then it's roughly equivalent to the current scheme (4 bytes more on 64-bit; equal on 32-bit).

Each MAryInstruction contains a |FixedArityList<MUse*, Arity> operands_| member, so we don't need to constantly allocate new MUse nodes: only one MUse is valid for a particular Operand slot, so replacing an MUse is literally done by replacement. Additionally, it becomes impossible/nonsensical to set an Operand without also setting a Use, which is great, since they are in 1:1 correlation. This has nice properties for invariant checking.

Then replaceOperand() is trivially O(1) by just removing the MUse from the producer's list, updating the producer pointer, and adding the MUse to the new producer's list, and indexes aren't used.
Ahem, that should be FixedArityList<MUse, Arity>. The memory for the MUses is built into the fixed-arity MDefinitions.
See comment #1 - let's wait until chunked compilation before doing anything here. Most likely we don't want to focus on optimizing replaceOperand. Given large scripts, any minor part of the compilation pipeline can explode.
Whiteboard: [ion:t]
As an additional note, there's slight difficulty when dealing with MPhis: if the MUse vector moves via realloc(), it is necessary to remove/reinsert each MUse in the phi from/into its corresponding use chain.
I can't reproduce this anymore. If we see another phi-elimination outlier, we can reopen and investigate again.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → WORKSFORME
Reopening. Have a data structure simplification patch that solves this issue (per Comment 4) and makes it impossible to break operand/use invariants. Also fixes some issues from present misuse of removed data structures.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Created attachment 674480 [details] [diff] [review]
WIP patch

Work in progress per the above comment. Everything is done and largely simplified -- the only part that gets more complicated is MPhi::addInput(), since we have to handle baked-in pointers around calls to possibly-relocating realloc(). The final version of this patch will remove the majority of addInput()'s callsites, since realloc() is almost never necessary.
Note that the reason this issue does not currently reproduce is because we currently have a max script length limit -- but this patch lets us compile more functions more quickly, letting us theoretically increase that limit. I'll benchmark sometime soon.
Created attachment 706066 [details] [diff] [review]
WIP Rebased and updated single patch

Applies against at least 6cbe99a8f3a7. Passes jit-tests.

I haven't profiled the patch, but the new annoying behavior in MPhi::addInput() might be a source of slowdown. It could be mitigated by storing MUse structures in static locations outside of the MPhi inputs Vector, then maintaining an MUse* Vector inside the MPhi.
Attachment #674480 - Attachment is obsolete: true

Comment 13

5 years ago
I just wanted to point out that this patch is a major improvement on the asm.js version of sqlite: it takes compilation time down from 18 seconds (which perf shows mostly in replaceOperand) down to 8.  Nice!

Comment 14

5 years ago
Oh, to wit: v8 spends 2 *minutes* on the same input (almost OOMing my machine).  Non-Odin SM, on the same input, runs entirely in JM and takes 12 seconds.
Created attachment 706624 [details] [diff] [review]
Inline MUse storage.

Updated patch. I decided against using MoveRef to hide logic from MPhi::addInput(): I would rather have the movement costs be explicit.

"Mostly green" on Try, with all oranges known and no assertions: https://tbpl.mozilla.org/?tree=Try&rev=d5db8b0e3302
Attachment #706066 - Attachment is obsolete: true
Attachment #706624 - Flags: review?(luke)

Comment 16

5 years ago
Comment on attachment 706624 [details] [diff] [review]
Inline MUse storage.

Review of attachment 706624 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!

::: js/src/ion/MIR.cpp
@@ +503,5 @@
>      size_t length = inputs_.length();
> +    for (size_t i = index; i < length - 1; i++) {
> +        MUse *next = getUseFor(i + 1);
> +        next->producer()->removeUse(next);
> +        setOperand(i, next->producer());

Could you write "MPhi::getUseFor" and "MPhi::setOperand" in this loop to avoid the virtual dispatch?

@@ +561,5 @@
> +    // that this operation may cause the producers' linked lists to reference
> +    // invalid memory. Therefore, in the event of moving reallocation, each
> +    // MUse must be removed and reinserted from/into its producer's use chain.
> +    uint32_t index = inputs_.length();
> +    bool performingRealloc = (inputs_.length() == inputs_.capacity());

This depends on rather internal Vector implementation details that may easily change one day.  Instead, could you add a Vector member functions (bool canAppendWithoutRealloc() const) so that this detail is officially recorded as part of the public interface.

::: js/src/ion/MIR.h
@@ +170,4 @@
>      void replaceOperand(size_t index, MDefinition *ins);
>  
> +    // Discards an operand.
> +    void discardOperand(size_t index);

Oh it does, does it ;)  Perhaps you could say something like "Resets the operand to an uninitialized state, breaking the link with the previous operand's producer".

@@ +3031,1 @@
>      bool addInput(MDefinition *ins);

Perhaps we could rename it addInputSlow and add to the comment: "Prefer to use initLength and setOperand instead"?
Attachment #706624 - Flags: review?(luke) → review+
https://hg.mozilla.org/mozilla-central/rev/52db3691dcba
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
You need to log in before you can comment on or make changes to this bug.