Open Bug 1922073 Opened 4 months ago Updated 1 month ago

js::jit::GenerateCode spends a lot of time in malloc lock contention

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

ARM64
Android
defect

Tracking

()

ASSIGNED
Performance Impact medium

People

(Reporter: denispal, Assigned: nbp)

References

(Depends on 1 open bug, Blocks 3 open bugs)

Details

(Whiteboard: [sp3])

You can see a lot of time spent trying to grow Vectors during code generation which leads to a lot of lock contention. This was measured during the NewsSite-Next test of Speedometer3.

https://share.firefox.dev/3NakO5Y

The code generator has vectors for OsiPoint and safepoint indexes. I think we should be able to pre-allocate these vectors to the right size if we track how many of these there are before codegen.

Also maybe giving the vectors some inline capacity would help. We should collect some data to see how large they get for typical functions.

Flags: needinfo?(jdemooij)

Looking at the profile, I see that:

# samples Time Under frame
308 104ms js::jit::CompactBufferWriter::writeByte
226 72ms js_pod_arena_realloc<js::jit::OutOfLineCode*>
153 43ms mozilla::Vector<js::jit::OsiIndex, …>::growStorageBy
148 42ms mozilla::Vector<js::jit::CodegenSafepointIndex, …>::growStorageBy

For the OutOfLineCode*, I do not see any reason to have a vector, and it can probably be allocated in the LifoAlloc.

For the OsiIndex and CodegenSafepointIndex, we can count the number of OSI Point inserted an make a single allocation.

For the CompactBufferWriter, which encapsulates the SafepointWriter and SnapshotWriter. These can probably be pre-allocated using some heuristics based the number of OSI Point and on the number ResumePoint operands.

Nicolas, maybe you can look into fixing this.

Flags: needinfo?(jdemooij) → needinfo?(nicolas.b.pierron)
Depends on: 1922259
Blocks: sm-runtime
Severity: -- → S3
Priority: -- → P2
Whiteboard: [js-perf-next]
Depends on: 1922829
Assignee: nobody → nicolas.b.pierron
Status: NEW → ASSIGNED
Flags: needinfo?(nicolas.b.pierron)
Depends on: 1923090

Conventional media will not tell you this:
One of the dependent patches (or all of them combined) have had some _relatively large _ improvement on Linux/Mac compared to Windows

WebAssembly Embenchen-compile : 2.4% on Linux / 1.2% on MacOS

Whiteboard: [js-perf-next] → [js-perf-next][sp3]

With the three dependencies fixed, is it worth rechecking this now?

Flags: needinfo?(dpalmeiro)

I have not done a direct before & after comparison, but I did recently measure Ion compile time as a whole and compared to my previous measurements in August we are much faster in RegAlloc and Codegen times.

August:
RegAlloc 11x worse than Maglev
Codegen 1.6x worse than Maglev

October:
RegAlloc 8x worse than Maglev
Codegen has parity with Maglev

So, these are really great improvements!

I collected a new profile that contains these fixes (https://share.firefox.dev/3YOVBVo) and thanks to this work I see most of the previous hot spots eliminated. There is still some malloc lock contention, but they are all coming from js::jit::CompactBufferWriter::writeByte which presumably will be much harder to eliminate. Perhaps we can pre-allocate this vector based on the bytecode size of the function?

Flags: needinfo?(dpalmeiro)

Glancing at the profile, it looks like the main uses of CompactBufferWriter are in these three per-CodeGenerator streams, so pre-allocation doesn't seem crazy.

Denis: Do we have any estimate of how much improvement we saw from reducing regalloc contention / how much we were allocating there vs here? In other words: how valuable do we think this would be?

(In reply to Iain Ireland [:iain] from comment #8)

Denis: Do we have any estimate of how much improvement we saw from reducing regalloc contention / how much we were allocating there vs here? In other words: how valuable do we think this would be?

It's hard to accurately estimate the impact since this work is done off-thread and subject to a lot of varying conditions. However, for SP3 I would estimate it will be very minor. Most of the time there is spent in Baseline, and improving Ion codegen any further would probably only be a minor improvement to overall Ion compile time. We're still spending most of that time in regalloc. This might be more impactful if we decide to lower Ion thresholds to get more functions into Ion sooner, but there are bigger issues at play.

Sounds reasonable. I'm going to bump this down in priority, since we've already extracted most of the benefit.

In the future, I do think that improvements to how we encode snapshots / safepoints / recover instructions are one of the more promising opportunities for reducing off-thread Ion compile time.

Priority: P2 → P3

I collected a bunch of stats based on our test suite, in order to predict the size of the compact buffers, so far without success.
I am planning to give it more CPU time to see if we could find an upper approximation.

Whiteboard: [js-perf-next][sp3] → [sp3]
Depends on: 1940069
You need to log in before you can comment on or make changes to this bug.