715737 - IonMonkey: Queueing instructions for LSRA is extremely expensive.

Assignee

Description

•

13 years ago

Dismayed that a patch to implement monomorphic callsites had negligible effect on crypto-md5 runtime, I ran the 'perf' profiler on an x86_64 opt build, --ion -n.

The profile is as follows:

>    46.56%    js  js                 [.] js::ion::LinearScanAllocator::UnhandledQueue::enqueue(js::ion::
>     9.21%    js  js                 [.] js::Interpret(JSContext*, js::StackFrame*, js::InterpMode)
>     7.00%    js  js                 [.] js::ParseNode::create(js::ParseNodeKind, js::ParseNodeArity, js
>     5.84%    js  [binfmt_misc]      [k] load_misc_binary
>     5.13%    js  [kernel.kallsyms]  [k] avtab_search_node
>     3.86%    js  js                 [.] js::ion::LiveInterval::addRange(js::ion::CodePosition, js::ion:
>     3.38%    js  js                 [.] js::types::TypeSet::addSubsetBarrier(JSContext*, JSScript*, uns
>     3.14%    js  js                 [.] js::ion::MDefinition::replaceAllUsesWith(js::ion::MDefinition*)
>     2.91%    js  [kernel.kallsyms]  [k] __alloc_pages_nodemask
>     2.57%    js  js                 [.] js::ion::BitSet::contains(unsigned int) const
>     2.55%    js  js                 [.] js::ion::LinearScanAllocator::resolveControlFlow()
>     2.55%    js  js                 [.] js::ion::LinearScanAllocator::allocateRegisters()
>     2.55%    js  [kernel.kallsyms]  [k] page_fault
>     2.55%    js  perf-17768.map     [.] 0x7fb921fad677
>     0.11%    js  [kernel.kallsyms]  [k] sched_exec
>     0.09%    js  [kernel.kallsyms]  [k] finish_task_switch
>     0.00%    js  [kernel.kallsyms]  [k] native_write_msr_safe

It appears that almost the majority of our time is spent just queueing nodes for LSRA -- potentially unsurprising given that we have ~6000 LIR instructions floating around just for the core_md5() function representation.

The profile with greedy looks different:

>    16.66%    js  js                 [.] JS_ResolveStandardClass
>    14.68%    js  js                 [.] js::Interpret(JSContext*, js::StackFrame*, js::InterpMode)
>    13.87%    js  [kernel.kallsyms]  [k] __kmalloc
>    13.10%    js  js                 [.] js::types::TypeMonitorResult(JSContext*, JSScript*, unsigned ch
>     9.03%    js  js                 [.] js::types::TypeSet::hasType(js::types::Type)
>     6.50%    js  [kernel.kallsyms]  [k] clear_page
>     6.23%    js  libc-2.14.90.so    [.] _int_malloc
>     6.17%    js  js                 [.] js::ion::ValueNumberer::findDominatingDef(js::HashMap<unsigned
>     6.13%    js  js                 [.] js::ion::GreedyAllocator::allocateInputs(js::ion::LInstruction*
>     6.10%    js  js                 [.] js::ion::MResumePoint::inherit(js::ion::MBasicBlock*)
>     1.24%    js  perf               [.] 0x19a02
>     0.27%    js  [kernel.kallsyms]  [k] stop_one_cpu
>     0.01%    js  [kernel.kallsyms]  [k] native_write_msr_safe

Perf does not give the absolute time spent in non-JIT code, so I also inserted a timing printf() in TestCompiler() around register allocation:

[sstangl@winhill opt64]$ time ./js --ion -n ~/js/crypto-md5.js 
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:           24
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0
Regalloc msec:            0

real	0m0.044s
user	0m0.038s
sys	0m0.005s

So indeed, as in perf, regalloc in that one (not even that long!) function occupies more than half of total execution time.

Sean Stangl [:sstangl]

Assignee

Comment 1

•

13 years ago

Also note that on this machine, crypto-md5 executes in 12ms with -m -a.

Sean Stangl [:sstangl]

Assignee

Comment 2

•

13 years ago

(In reply to Sean Stangl from comment #1)
> Also note that on this machine, crypto-md5 executes in 12ms with -m -a.
Ahem, I meant -m -n.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 3

•

13 years ago

Bleh. Two things here: 

 (1) We should not be inlining everything in crypto-md5. bug 706472 will fix that. (crypto-md5 was notorious for giving the tracer inlining pain for little benefit).

 (2) When we initially enqueue virtual registers, we already get them in sorted order. We shouldn't need a sorted initial enqueing.

Sean Stangl [:sstangl]

Assignee

Comment 4

•

13 years ago

(In reply to David Anderson [:dvander] from comment #3)
>  (1) We should not be inlining everything in crypto-md5. bug 706472 will fix
> that. (crypto-md5 was notorious for giving the tracer inlining pain for
> little benefit).

Very little is being inlined -- only the safe_add() calls at the end, which are negligible. All of the md5_xx() calls exist as calls, so I don't think inlining is at fault here.

Given that most of the time is spent in enqueue(), I would blame the data structure and algorithm. The UnhandledQueue inflates to a size of 2865 live intervals, drops down to 0, then inflates back up to 1930. Each node is placed by insertion sort over a linked list, which is extremely slow even before matters of its pessimal cache locality.(In 

>  (2) When we initially enqueue virtual registers, we already get them in
> sorted order. We shouldn't need a sorted initial enqueing.

It's an enqueueing of live intervals, which contain priorities that differ from virtual registers. They may map somewhat, but I'm reading them as distinct. Also note that the enqueueing isn't initial -- it actually builds three queues (inactive, active, handled), then mashes them all together via insertion sort at the top of reifyAllocations().

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 5

•

13 years ago

When we initially enqueue there is one live interval for each virtual register, and I believe this list is already sorted. So we could either just add them in the order they appear, and be done, or do that and then sort them after with qsort() or something.

I'm not sure about reifyAllocations, which requeues everything - I don't know whether that has to be sorted or not but we should do the same thing there (bulk add unsorted, then resort after).

Sean Stangl [:sstangl]

Assignee

Updated

•

13 years ago

Assignee: general → sstangl

Jan de Mooij [:jandem]

Comment 6

•

13 years ago

Good catch.

reifyAllocations should assert unhandled.empty() and inactive.empty() (there's no way these can become active again) and probably add "active" to "handled", since most intervals are already in "handled", and use handled instead of unhandled.

The order matters for MUST_REUSE_INPUT, it assumes the input has been reified. Afaik this doesn't need a sort, it could handle both the input-is-LUse-or-not cases or traverse the list twice or something.

Jan de Mooij [:jandem]

Updated

•

13 years ago

Blocks: IonSpeed

Sean Stangl [:sstangl]

Assignee

Comment 7

•

13 years ago

(In reply to Jan de Mooij (:jandem) from comment #6)
> reifyAllocations should assert unhandled.empty() and inactive.empty()
> (there's no way these can become active again) and probably add "active" to
> "handled", since most intervals are already in "handled", and use handled
> instead of unhandled.

The handled list is unsorted. Fixing addSpillIntervals() reduced execution time in crypto-md5 by 10ms; making allocateRegisters() a bit less naive saves an additional 3ms. The rest probably needs smarter sorting.

Jan de Mooij [:jandem]

Comment 8

•

13 years ago

(In reply to Sean Stangl from comment #7)
> The handled list is unsorted.

But that's the point - as I said in the second paragraph, reifyAllocations should not require a sorted list. Even if we sorted the intervals very efficiently, it seems wasteful to use it here.

Sean Stangl [:sstangl]

Assignee

Comment 9

•

13 years ago

Fixing reifyAllocations() by iterating over virtual registers (no sorting), jandem's idea, dropped execution time by a further 3ms.

Fixing splitInterval() by iterating from the opposite side cut down ~1,000,000 accesses/comparisons in enqueue() to 78,904. Execution time dropped by a further 10ms.

Total execution time on crypto-md5 is now ~22ms, down from ~44.

Sean Stangl [:sstangl]

Assignee

Updated

•

13 years ago

Blocks: 668302

Sean Stangl [:sstangl]

Assignee

Comment 10

•

13 years ago

Attached patch Patch. — Details — Splinter Review

Running tests now.

Attachment #587211 - Flags: review?(jdemooij)

Sean Stangl [:sstangl]

Assignee

Comment 11

•

13 years ago

No additional failures.

Jan de Mooij [:jandem]

Comment 12

•

13 years ago

Comment on attachment 587211 [details] [diff] [review]
Patch.

Review of attachment 587211 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work, glad to have this fixed.

::: js/src/ion/LinearScan.cpp
@@ +597,5 @@
> + * Merge virtual register intervals into the UnhandledQueue, taking advantage
> + * of their nearly-sorted ordering.
> + */
> +void
> +LinearScanAllocator::subsumeVirtualRegisterIntervals()

Nit: s/subsume/insert? "enqueue"?

@@ +616,5 @@
> +            }
> +
> +            // Insert forward from the current position, thereby
> +            // sorting by priority within the start class.
> +            unhandled.enqueueForward(*curr, live);

Nice.

@@ +866,5 @@
>  LinearScanAllocator::reifyAllocations()
>  {
> +    // Iterate over each interval, ensuring that definitions are visited before uses.
> +    for (size_t ii = 1; ii < graph.numVirtualRegisters(); ii++) {
> +    for (size_t jj = 0; jj < vregs[ii].numIntervals(); jj++) {

I'd prefer to use i and j here, but |i| may be used by other loops in this function so this is probably the least error-prone way, although it looks a bit unusual.

@@ +873,4 @@
>              continue;
>  
> +        VirtualRegister *reg = interval->reg();
> +        JS_ASSERT(reg);

Nit: add VirtualRegister *reg = vregs[ii]; before the inner loop and use it instead of vregs[ii]. And add a JS_ASSERT(interval->reg() == reg) here.

Attachment #587211 - Flags: review?(jdemooij) → review+

Sean Stangl [:sstangl]

Assignee

Comment 13

•

13 years ago

hg.mozilla.org/projects/ionmonkey/rev/6caa08de3522

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

IonMonkey: Queueing instructions for LSRA is extremely expensive.

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: sstangl, Assigned: sstangl)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Comment 13

Attachment

General

Description

File Name

Content Type