Open Bug 1843932 Opened 1 year ago Updated 1 month ago

Too much time spent in the allocator during innerHTML parsing when creating DOM nodes

Categories

(Core :: Memory Allocator, defect, P2)

defect

Tracking

()

People

(Reporter: mstange, Unassigned)

References

(Depends on 5 open bugs, Blocks 1 open bug)

Details

(Whiteboard: [sp3])

11% of our innerHTML time is spent allocating DOM nodes:
5500 samples in nsNodeInfoManager::Allocate: https://share.firefox.dev/3PZHDf7
of 48000 samples in the innerHTML setter: https://share.firefox.dev/3K0dwQY

Reducing this allocation time would have a big impact on or score on the innerHTML-bound speedometer3 subtest, e.g. TodoMVC-jQuery. You can run the test here: https://shell--speedometer-preview.netlify.app/?suite=TodoMVC-jQuery&iterationCount=33

50% of the DOM arena allocations are for text nodes, 25% for input elements: https://share.firefox.dev/44qhVop

Can we increase some sizes here to "batch" some overhead? E.g. I'm not sure what a bin run is, but could we reduce the number of calls to AllocRun by making runs bigger? Or can we reduce the number of calls to RedBlackTree::First?

Furthermore, mutex locking and unlocking takes a fair amount of time. Why do we need to lock here? Aren't the allocations in the DOM arena main-thread-only?

Alternatively, could we add an API to create multiple allocations with a single locking? For example, we could allocate TextNodes in batches of 128, similar to what we do for MiscContainer (though MiscContainer is definitely hitting the lock 128 times here; the goal for TextNode would be to only hit it once per batch).

To compare against Chrome, we can count the number of samples in nsNodeInfoManager::Allocate under nsHtml5TreeBuilder::flushCharacters to the number of samples in MakeGarbageCollectedTraitBase<blink::Text>::Allocate:
Firefox: https://share.firefox.dev/43BihHW 2699 samples
Chrome: https://share.firefox.dev/43xLtPQ 492 samples (5x faster)
(or 1198 samples if you include member initialization: https://share.firefox.dev/43tQ0Tv , which is still over 2x faster)

(In reply to Markus Stange [:mstange] from comment #0)

Furthermore, mutex locking and unlocking takes a fair amount of time. Why do we need to lock here? Aren't the allocations in the DOM arena main-thread-only?

It looks like the lock that we're spending time on is in ArenaCollection::GetById. Could we not hand out arena_t's instead of arena_id_t's and avoid the locking here?

See Also: → 1824655

(In reply to Jeff Muizelaar [:jrmuizel] from comment #1)

(In reply to Markus Stange [:mstange] from comment #0)

Furthermore, mutex locking and unlocking takes a fair amount of time. Why do we need to lock here? Aren't the allocations in the DOM arena main-thread-only?

It looks like the lock that we're spending time on is in ArenaCollection::GetById. Could we not hand out arena_t's instead of arena_id_t's and avoid the locking here?

I proposed this but it was rejected since the arena_id abstraction is there for security. No-body is really sure how much security it provides though. The discussion is on Bug 1810953. The latest suggestion was to use an RWLock but I found that it provided no perf benefit. I'd like to re-propose removing that abstraction again with some performance results. So I'd like to use this test as an example to motivate that.

(In reply to Markus Stange [:mstange] from comment #0)

11% of our innerHTML time is spent allocating DOM nodes:
5500 samples in nsNodeInfoManager::Allocate: https://share.firefox.dev/3PZHDf7
of 48000 samples in the innerHTML setter: https://share.firefox.dev/3K0dwQY

Reducing this allocation time would have a big impact on or score on the innerHTML-bound speedometer3 subtest, e.g. TodoMVC-jQuery. You can run the test here: https://shell--speedometer-preview.netlify.app/?suite=TodoMVC-jQuery&iterationCount=33

50% of the DOM arena allocations are for text nodes, 25% for input elements: https://share.firefox.dev/44qhVop

Can we increase some sizes here to "batch" some overhead? E.g. I'm not sure what a bin run is, but could we reduce the number of calls to AllocRun by making runs bigger? Or can we reduce the number of calls to RedBlackTree::First?

I'll look into it (Bug 1843998)

Furthermore, mutex locking and unlocking takes a fair amount of time. Why do we need to lock here? Aren't the allocations in the DOM arena main-thread-only?

Alternatively, could we add an API to create multiple allocations with a single locking? For example, we could allocate TextNodes in batches of 128, similar to what we do for MiscContainer (though MiscContainer is definitely hitting the lock 128 times here; the goal for TextNode would be to only hit it once per batch).

YES! I have wanted to do this for a little while now. (Bug 1843997)

Depends on: 1843997, 1810953, 1843998

I've tried various versions of batch allocations for nodes and so far haven't found a way to make it improve performance. But yes, if mozjemalloc provided some new API to avoid locking, that might work better.

(In reply to Paul Bone [:pbone] from comment #2)

I proposed this but it was rejected since the arena_id abstraction is there for security. No-body is really sure how much security it provides though. The discussion is on Bug 1810953. The latest suggestion was to use an RWLock but I found that it provided no perf benefit. I'd like to re-propose removing that abstraction again with some performance results. So I'd like to use this test as an example to motivate that.

It's worth looking closer, but I believe the performance cost of the lock is the atomic in a cache line that's shared between threads. i.e. the lock is not actually contended but the cache line is still shared. This would explain why using a rwlock doesn't help.

I took a closer look:

Here's the assembly for RtlEnterCriticalSection from https://share.firefox.dev/3K3tMkj:

sub rsp, 0x28
mov rax, qword gs:[0x30]
lock btr qword [rcx + 0x8], 0x0
mov rax, qword [rax + 0x48]
jnb $+0x13
mov qword [rcx + 0x10], rax
xor eax, eax
mov dword [rcx + 0xc], 0x1
add rsp, 0x28
ret
int 0x3
cmp qword [rcx + 0x10], rax
jnz $+0xb
inc dword [rcx + 0xc]
xor eax, eax
add rsp, 0x28
ret
int 0x3
call 0xe
jmp $-0x1e

We get 105 samples on the instruction after lock btr qword [rcx + 0x8], 0x0. I'd say this confirms my theory that the cost here is not from actual lock contention but instead from cacheline sharing between threads.

Severity: -- → S3
Priority: -- → P2
Component: DOM: Core & HTML → Memory Allocator
Depends on: 1844359

Bug 1844359 improved the time in nsNodeInfoManager::Allocate by 20%.

New profile: https://share.firefox.dev/47ExXx3 (4340 samples)

Depends on: 1857133
Depends on: 1928177
You need to log in before you can comment on or make changes to this bug.