Closed Bug 715737 Opened 13 years ago Closed 13 years ago

IonMonkey: Queueing instructions for LSRA is extremely expensive.

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sstangl, Assigned: sstangl)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Dismayed that a patch to implement monomorphic callsites had negligible effect on crypto-md5 runtime, I ran the 'perf' profiler on an x86_64 opt build, --ion -n.

The profile is as follows:

>    46.56%    js  js                 [.] js::ion::LinearScanAllocator::UnhandledQueue::enqueue(js::ion::
>     9.21%    js  js                 [.] js::Interpret(JSContext*, js::StackFrame*, js::InterpMode)
>     7.00%    js  js                 [.] js::ParseNode::create(js::ParseNodeKind, js::ParseNodeArity, js
>     5.84%    js  [binfmt_misc]      [k] load_misc_binary
>     5.13%    js  [kernel.kallsyms]  [k] avtab_search_node
>     3.86%    js  js                 [.] js::ion::LiveInterval::addRange(js::ion::CodePosition, js::ion:
>     3.38%    js  js                 [.] js::types::TypeSet::addSubsetBarrier(JSContext*, JSScript*, uns
>     3.14%    js  js                 [.] js::ion::MDefinition::replaceAllUsesWith(js::ion::MDefinition*)
>     2.91%    js  [kernel.kallsyms]  [k] __alloc_pages_nodemask
>     2.57%    js  js                 [.] js::ion::BitSet::contains(unsigned int) const
>     2.55%    js  js                 [.] js::ion::LinearScanAllocator::resolveControlFlow()
>     2.55%    js  js                 [.] js::ion::LinearScanAllocator::allocateRegisters()
>     2.55%    js  [kernel.kallsyms]  [k] page_fault
>     2.55%    js  perf-17768.map     [.] 0x7fb921fad677
>     0.11%    js  [kernel.kallsyms]  [k] sched_exec
>     0.09%    js  [kernel.kallsyms]  [k] finish_task_switch
>     0.00%    js  [kernel.kallsyms]  [k] native_write_msr_safe

It appears that almost the majority of our time is spent just queueing nodes for LSRA -- potentially unsurprising given that we have ~6000 LIR instructions floating around just for the core_md5() function representation.

The profile with greedy looks different:

>    16.66%    js  js                 [.] JS_ResolveStandardClass
>    14.68%    js  js                 [.] js::Interpret(JSContext*, js::StackFrame*, js::InterpMode)
>    13.87%    js  [kernel.kallsyms]  [k] __kmalloc
>    13.10%    js  js                 [.] js::types::TypeMonitorResult(JSContext*, JSScript*, unsigned ch
>     9.03%    js  js                 [.] js::types::TypeSet::hasType(js::types::Type)
>     6.50%    js  [kernel.kallsyms]  [k] clear_page
>     6.23%    js  libc-2.14.90.so    [.] _int_malloc
>     6.17%    js  js                 [.] js::ion::ValueNumberer::findDominatingDef(js::HashMap<unsigned
>     6.13%    js  js                 [.] js::ion::GreedyAllocator::allocateInputs(js::ion::LInstruction*
>     6.10%    js  js                 [.] js::ion::MResumePoint::inherit(js::ion::MBasicBlock*)
>     1.24%    js  perf               [.] 0x19a02
>     0.27%    js  [kernel.kallsyms]  [k] stop_one_cpu
>     0.01%    js  [kernel.kallsyms]  [k] native_write_msr_safe

Perf does not give the absolute time spent in non-JIT code, so I also inserted a timing printf() in TestCompiler() around register allocation:

[sstangl@winhill opt64]$ time ./js --ion -n ~/js/crypto-md5.js 
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:           24
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0

real	0m0.044s
user	0m0.038s
sys	0m0.005s

So indeed, as in perf, regalloc in that one (not even that long!) function occupies more than half of total execution time.
Also note that on this machine, crypto-md5 executes in 12ms with -m -a.
(In reply to Sean Stangl from comment #1)
> Also note that on this machine, crypto-md5 executes in 12ms with -m -a.
Ahem, I meant -m -n.
Bleh. Two things here: 

 (1) We should not be inlining everything in crypto-md5. bug 706472 will fix that. (crypto-md5 was notorious for giving the tracer inlining pain for little benefit).

 (2) When we initially enqueue virtual registers, we already get them in sorted order. We shouldn't need a sorted initial enqueing.
(In reply to David Anderson [:dvander] from comment #3)
>  (1) We should not be inlining everything in crypto-md5. bug 706472 will fix
> that. (crypto-md5 was notorious for giving the tracer inlining pain for
> little benefit).

Very little is being inlined -- only the safe_add() calls at the end, which are negligible. All of the md5_xx() calls exist as calls, so I don't think inlining is at fault here.

Given that most of the time is spent in enqueue(), I would blame the data structure and algorithm. The UnhandledQueue inflates to a size of 2865 live intervals, drops down to 0, then inflates back up to 1930. Each node is placed by insertion sort over a linked list, which is extremely slow even before matters of its pessimal cache locality.(In 

>  (2) When we initially enqueue virtual registers, we already get them in
> sorted order. We shouldn't need a sorted initial enqueing.

It's an enqueueing of live intervals, which contain priorities that differ from virtual registers. They may map somewhat, but I'm reading them as distinct. Also note that the enqueueing isn't initial -- it actually builds three queues (inactive, active, handled), then mashes them all together via insertion sort at the top of reifyAllocations().
When we initially enqueue there is one live interval for each virtual register, and I believe this list is already sorted. So we could either just add them in the order they appear, and be done, or do that and then sort them after with qsort() or something.

I'm not sure about reifyAllocations, which requeues everything - I don't know whether that has to be sorted or not but we should do the same thing there (bulk add unsorted, then resort after).
Assignee: general → sstangl
Good catch.

reifyAllocations should assert unhandled.empty() and inactive.empty() (there's no way these can become active again) and probably add "active" to "handled", since most intervals are already in "handled", and use handled instead of unhandled.

The order matters for MUST_REUSE_INPUT, it assumes the input has been reified. Afaik this doesn't need a sort, it could handle both the input-is-LUse-or-not cases or traverse the list twice or something.
Blocks: IonSpeed
(In reply to Jan de Mooij (:jandem) from comment #6)
> reifyAllocations should assert unhandled.empty() and inactive.empty()
> (there's no way these can become active again) and probably add "active" to
> "handled", since most intervals are already in "handled", and use handled
> instead of unhandled.

The handled list is unsorted. Fixing addSpillIntervals() reduced execution time in crypto-md5 by 10ms; making allocateRegisters() a bit less naive saves an additional 3ms. The rest probably needs smarter sorting.
(In reply to Sean Stangl from comment #7)
> The handled list is unsorted.

But that's the point - as I said in the second paragraph, reifyAllocations should not require a sorted list. Even if we sorted the intervals very efficiently, it seems wasteful to use it here.
Fixing reifyAllocations() by iterating over virtual registers (no sorting), jandem's idea, dropped execution time by a further 3ms.

Fixing splitInterval() by iterating from the opposite side cut down ~1,000,000 accesses/comparisons in enqueue() to 78,904. Execution time dropped by a further 10ms.

Total execution time on crypto-md5 is now ~22ms, down from ~44.
Blocks: 668302
Attached patch Patch.Splinter Review
Running tests now.
Attachment #587211 - Flags: review?(jdemooij)
No additional failures.
Comment on attachment 587211 [details] [diff] [review]
Patch.

Review of attachment 587211 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work, glad to have this fixed.

::: js/src/ion/LinearScan.cpp
@@ +597,5 @@
> + * Merge virtual register intervals into the UnhandledQueue, taking advantage
> + * of their nearly-sorted ordering.
> + */
> +void
> +LinearScanAllocator::subsumeVirtualRegisterIntervals()

Nit: s/subsume/insert? "enqueue"?

@@ +616,5 @@
> +            }
> +
> +            // Insert forward from the current position, thereby
> +            // sorting by priority within the start class.
> +            unhandled.enqueueForward(*curr, live);

Nice.

@@ +866,5 @@
>  LinearScanAllocator::reifyAllocations()
>  {
> +    // Iterate over each interval, ensuring that definitions are visited before uses.
> +    for (size_t ii = 1; ii < graph.numVirtualRegisters(); ii++) {
> +    for (size_t jj = 0; jj < vregs[ii].numIntervals(); jj++) {

I'd prefer to use i and j here, but |i| may be used by other loops in this function so this is probably the least error-prone way, although it looks a bit unusual.

@@ +873,4 @@
>              continue;
>  
> +        VirtualRegister *reg = interval->reg();
> +        JS_ASSERT(reg);

Nit: add VirtualRegister *reg = vregs[ii]; before the inner loop and use it instead of vregs[ii]. And add a JS_ASSERT(interval->reg() == reg) here.
Attachment #587211 - Flags: review?(jdemooij) → review+
hg.mozilla.org/projects/ionmonkey/rev/6caa08de3522
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: