Last Comment Bug 715737 - IonMonkey: Queueing instructions for LSRA is extremely expensive.
: IonMonkey: Queueing instructions for LSRA is extremely expensive.
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Sean Stangl [:sstangl]
:
: Jason Orendorff [:jorendorff]
Mentors:
Depends on:
Blocks: IonSpeed 668302
  Show dependency treegraph
 
Reported: 2012-01-05 17:00 PST by Sean Stangl [:sstangl]
Modified: 2012-01-10 13:40 PST (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch. (10.96 KB, patch)
2012-01-09 17:24 PST, Sean Stangl [:sstangl]
jdemooij: review+
Details | Diff | Splinter Review

Description Sean Stangl [:sstangl] 2012-01-05 17:00:58 PST
Dismayed that a patch to implement monomorphic callsites had negligible effect on crypto-md5 runtime, I ran the 'perf' profiler on an x86_64 opt build, --ion -n.

The profile is as follows:

>    46.56%    js  js                 [.] js::ion::LinearScanAllocator::UnhandledQueue::enqueue(js::ion::
>     9.21%    js  js                 [.] js::Interpret(JSContext*, js::StackFrame*, js::InterpMode)
>     7.00%    js  js                 [.] js::ParseNode::create(js::ParseNodeKind, js::ParseNodeArity, js
>     5.84%    js  [binfmt_misc]      [k] load_misc_binary
>     5.13%    js  [kernel.kallsyms]  [k] avtab_search_node
>     3.86%    js  js                 [.] js::ion::LiveInterval::addRange(js::ion::CodePosition, js::ion:
>     3.38%    js  js                 [.] js::types::TypeSet::addSubsetBarrier(JSContext*, JSScript*, uns
>     3.14%    js  js                 [.] js::ion::MDefinition::replaceAllUsesWith(js::ion::MDefinition*)
>     2.91%    js  [kernel.kallsyms]  [k] __alloc_pages_nodemask
>     2.57%    js  js                 [.] js::ion::BitSet::contains(unsigned int) const
>     2.55%    js  js                 [.] js::ion::LinearScanAllocator::resolveControlFlow()
>     2.55%    js  js                 [.] js::ion::LinearScanAllocator::allocateRegisters()
>     2.55%    js  [kernel.kallsyms]  [k] page_fault
>     2.55%    js  perf-17768.map     [.] 0x7fb921fad677
>     0.11%    js  [kernel.kallsyms]  [k] sched_exec
>     0.09%    js  [kernel.kallsyms]  [k] finish_task_switch
>     0.00%    js  [kernel.kallsyms]  [k] native_write_msr_safe

It appears that almost the majority of our time is spent just queueing nodes for LSRA -- potentially unsurprising given that we have ~6000 LIR instructions floating around just for the core_md5() function representation.

The profile with greedy looks different:

>    16.66%    js  js                 [.] JS_ResolveStandardClass
>    14.68%    js  js                 [.] js::Interpret(JSContext*, js::StackFrame*, js::InterpMode)
>    13.87%    js  [kernel.kallsyms]  [k] __kmalloc
>    13.10%    js  js                 [.] js::types::TypeMonitorResult(JSContext*, JSScript*, unsigned ch
>     9.03%    js  js                 [.] js::types::TypeSet::hasType(js::types::Type)
>     6.50%    js  [kernel.kallsyms]  [k] clear_page
>     6.23%    js  libc-2.14.90.so    [.] _int_malloc
>     6.17%    js  js                 [.] js::ion::ValueNumberer::findDominatingDef(js::HashMap<unsigned
>     6.13%    js  js                 [.] js::ion::GreedyAllocator::allocateInputs(js::ion::LInstruction*
>     6.10%    js  js                 [.] js::ion::MResumePoint::inherit(js::ion::MBasicBlock*)
>     1.24%    js  perf               [.] 0x19a02
>     0.27%    js  [kernel.kallsyms]  [k] stop_one_cpu
>     0.01%    js  [kernel.kallsyms]  [k] native_write_msr_safe

Perf does not give the absolute time spent in non-JIT code, so I also inserted a timing printf() in TestCompiler() around register allocation:

[sstangl@winhill opt64]$ time ./js --ion -n ~/js/crypto-md5.js 
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:           24
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0

real	0m0.044s
user	0m0.038s
sys	0m0.005s

So indeed, as in perf, regalloc in that one (not even that long!) function occupies more than half of total execution time.
Comment 1 Sean Stangl [:sstangl] 2012-01-05 17:10:19 PST
Also note that on this machine, crypto-md5 executes in 12ms with -m -a.
Comment 2 Sean Stangl [:sstangl] 2012-01-05 17:10:50 PST
(In reply to Sean Stangl from comment #1)
> Also note that on this machine, crypto-md5 executes in 12ms with -m -a.
Ahem, I meant -m -n.
Comment 3 David Anderson [:dvander] 2012-01-05 17:20:32 PST
Bleh. Two things here: 

 (1) We should not be inlining everything in crypto-md5. bug 706472 will fix that. (crypto-md5 was notorious for giving the tracer inlining pain for little benefit).

 (2) When we initially enqueue virtual registers, we already get them in sorted order. We shouldn't need a sorted initial enqueing.
Comment 4 Sean Stangl [:sstangl] 2012-01-05 17:32:04 PST
(In reply to David Anderson [:dvander] from comment #3)
>  (1) We should not be inlining everything in crypto-md5. bug 706472 will fix
> that. (crypto-md5 was notorious for giving the tracer inlining pain for
> little benefit).

Very little is being inlined -- only the safe_add() calls at the end, which are negligible. All of the md5_xx() calls exist as calls, so I don't think inlining is at fault here.

Given that most of the time is spent in enqueue(), I would blame the data structure and algorithm. The UnhandledQueue inflates to a size of 2865 live intervals, drops down to 0, then inflates back up to 1930. Each node is placed by insertion sort over a linked list, which is extremely slow even before matters of its pessimal cache locality.(In 

>  (2) When we initially enqueue virtual registers, we already get them in
> sorted order. We shouldn't need a sorted initial enqueing.

It's an enqueueing of live intervals, which contain priorities that differ from virtual registers. They may map somewhat, but I'm reading them as distinct. Also note that the enqueueing isn't initial -- it actually builds three queues (inactive, active, handled), then mashes them all together via insertion sort at the top of reifyAllocations().
Comment 5 David Anderson [:dvander] 2012-01-05 17:46:13 PST
When we initially enqueue there is one live interval for each virtual register, and I believe this list is already sorted. So we could either just add them in the order they appear, and be done, or do that and then sort them after with qsort() or something.

I'm not sure about reifyAllocations, which requeues everything - I don't know whether that has to be sorted or not but we should do the same thing there (bulk add unsorted, then resort after).
Comment 6 Jan de Mooij [:jandem] 2012-01-06 00:56:52 PST
Good catch.

reifyAllocations should assert unhandled.empty() and inactive.empty() (there's no way these can become active again) and probably add "active" to "handled", since most intervals are already in "handled", and use handled instead of unhandled.

The order matters for MUST_REUSE_INPUT, it assumes the input has been reified. Afaik this doesn't need a sort, it could handle both the input-is-LUse-or-not cases or traverse the list twice or something.
Comment 7 Sean Stangl [:sstangl] 2012-01-06 16:50:05 PST
(In reply to Jan de Mooij (:jandem) from comment #6)
> reifyAllocations should assert unhandled.empty() and inactive.empty()
> (there's no way these can become active again) and probably add "active" to
> "handled", since most intervals are already in "handled", and use handled
> instead of unhandled.

The handled list is unsorted. Fixing addSpillIntervals() reduced execution time in crypto-md5 by 10ms; making allocateRegisters() a bit less naive saves an additional 3ms. The rest probably needs smarter sorting.
Comment 8 Jan de Mooij [:jandem] 2012-01-07 01:40:24 PST
(In reply to Sean Stangl from comment #7)
> The handled list is unsorted.

But that's the point - as I said in the second paragraph, reifyAllocations should not require a sorted list. Even if we sorted the intervals very efficiently, it seems wasteful to use it here.
Comment 9 Sean Stangl [:sstangl] 2012-01-09 15:18:13 PST
Fixing reifyAllocations() by iterating over virtual registers (no sorting), jandem's idea, dropped execution time by a further 3ms.

Fixing splitInterval() by iterating from the opposite side cut down ~1,000,000 accesses/comparisons in enqueue() to 78,904. Execution time dropped by a further 10ms.

Total execution time on crypto-md5 is now ~22ms, down from ~44.
Comment 10 Sean Stangl [:sstangl] 2012-01-09 17:24:48 PST
Created attachment 587211 [details] [diff] [review]
Patch.

Running tests now.
Comment 11 Sean Stangl [:sstangl] 2012-01-09 17:50:55 PST
No additional failures.
Comment 12 Jan de Mooij [:jandem] 2012-01-10 02:16:47 PST
Comment on attachment 587211 [details] [diff] [review]
Patch.

Review of attachment 587211 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work, glad to have this fixed.

::: js/src/ion/LinearScan.cpp
@@ +597,5 @@
> + * Merge virtual register intervals into the UnhandledQueue, taking advantage
> + * of their nearly-sorted ordering.
> + */
> +void
> +LinearScanAllocator::subsumeVirtualRegisterIntervals()

Nit: s/subsume/insert? "enqueue"?

@@ +616,5 @@
> +            }
> +
> +            // Insert forward from the current position, thereby
> +            // sorting by priority within the start class.
> +            unhandled.enqueueForward(*curr, live);

Nice.

@@ +866,5 @@
>  LinearScanAllocator::reifyAllocations()
>  {
> +    // Iterate over each interval, ensuring that definitions are visited before uses.
> +    for (size_t ii = 1; ii < graph.numVirtualRegisters(); ii++) {
> +    for (size_t jj = 0; jj < vregs[ii].numIntervals(); jj++) {

I'd prefer to use i and j here, but |i| may be used by other loops in this function so this is probably the least error-prone way, although it looks a bit unusual.

@@ +873,4 @@
>              continue;
>  
> +        VirtualRegister *reg = interval->reg();
> +        JS_ASSERT(reg);

Nit: add VirtualRegister *reg = vregs[ii]; before the inner loop and use it instead of vregs[ii]. And add a JS_ASSERT(interval->reg() == reg) here.
Comment 13 Sean Stangl [:sstangl] 2012-01-10 13:40:34 PST
hg.mozilla.org/projects/ionmonkey/rev/6caa08de3522

Note You need to log in before you can comment on or make changes to this bug.