Too much time spent in the allocator during innerHTML parsing when creating DOM nodes
Categories
(Core :: Memory Allocator, defect, P2)
Tracking
()
People
(Reporter: mstange, Unassigned)
References
(Depends on 5 open bugs, Blocks 1 open bug)
Details
(Whiteboard: [sp3])
11% of our innerHTML time is spent allocating DOM nodes:
5500 samples in nsNodeInfoManager::Allocate
: https://share.firefox.dev/3PZHDf7
of 48000 samples in the innerHTML
setter: https://share.firefox.dev/3K0dwQY
Reducing this allocation time would have a big impact on or score on the innerHTML-bound speedometer3 subtest, e.g. TodoMVC-jQuery. You can run the test here: https://shell--speedometer-preview.netlify.app/?suite=TodoMVC-jQuery&iterationCount=33
50% of the DOM arena allocations are for text nodes, 25% for input elements: https://share.firefox.dev/44qhVop
Can we increase some sizes here to "batch" some overhead? E.g. I'm not sure what a bin run is, but could we reduce the number of calls to AllocRun
by making runs bigger? Or can we reduce the number of calls to RedBlackTree::First?
Furthermore, mutex locking and unlocking takes a fair amount of time. Why do we need to lock here? Aren't the allocations in the DOM arena main-thread-only?
Alternatively, could we add an API to create multiple allocations with a single locking? For example, we could allocate TextNodes in batches of 128, similar to what we do for MiscContainer (though MiscContainer is definitely hitting the lock 128 times here; the goal for TextNode would be to only hit it once per batch).
To compare against Chrome, we can count the number of samples in nsNodeInfoManager::Allocate
under nsHtml5TreeBuilder::flushCharacters
to the number of samples in MakeGarbageCollectedTraitBase<blink::Text>::Allocate
:
Firefox: https://share.firefox.dev/43BihHW 2699 samples
Chrome: https://share.firefox.dev/43xLtPQ 492 samples (5x faster)
(or 1198 samples if you include member initialization: https://share.firefox.dev/43tQ0Tv , which is still over 2x faster)
Updated•1 year ago
|
Comment 1•1 year ago
|
||
(In reply to Markus Stange [:mstange] from comment #0)
Furthermore, mutex locking and unlocking takes a fair amount of time. Why do we need to lock here? Aren't the allocations in the DOM arena main-thread-only?
It looks like the lock that we're spending time on is in ArenaCollection::GetById
. Could we not hand out arena_t's instead of arena_id_t's and avoid the locking here?
Comment 2•1 year ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #1)
(In reply to Markus Stange [:mstange] from comment #0)
Furthermore, mutex locking and unlocking takes a fair amount of time. Why do we need to lock here? Aren't the allocations in the DOM arena main-thread-only?
It looks like the lock that we're spending time on is in
ArenaCollection::GetById
. Could we not hand out arena_t's instead of arena_id_t's and avoid the locking here?
I proposed this but it was rejected since the arena_id abstraction is there for security. No-body is really sure how much security it provides though. The discussion is on Bug 1810953. The latest suggestion was to use an RWLock but I found that it provided no perf benefit. I'd like to re-propose removing that abstraction again with some performance results. So I'd like to use this test as an example to motivate that.
Comment 3•1 year ago
|
||
(In reply to Markus Stange [:mstange] from comment #0)
11% of our innerHTML time is spent allocating DOM nodes:
5500 samples innsNodeInfoManager::Allocate
: https://share.firefox.dev/3PZHDf7
of 48000 samples in theinnerHTML
setter: https://share.firefox.dev/3K0dwQYReducing this allocation time would have a big impact on or score on the innerHTML-bound speedometer3 subtest, e.g. TodoMVC-jQuery. You can run the test here: https://shell--speedometer-preview.netlify.app/?suite=TodoMVC-jQuery&iterationCount=33
50% of the DOM arena allocations are for text nodes, 25% for input elements: https://share.firefox.dev/44qhVop
Can we increase some sizes here to "batch" some overhead? E.g. I'm not sure what a bin run is, but could we reduce the number of calls to
AllocRun
by making runs bigger? Or can we reduce the number of calls to RedBlackTree::First?
I'll look into it (Bug 1843998)
Furthermore, mutex locking and unlocking takes a fair amount of time. Why do we need to lock here? Aren't the allocations in the DOM arena main-thread-only?
Alternatively, could we add an API to create multiple allocations with a single locking? For example, we could allocate TextNodes in batches of 128, similar to what we do for MiscContainer (though MiscContainer is definitely hitting the lock 128 times here; the goal for TextNode would be to only hit it once per batch).
YES! I have wanted to do this for a little while now. (Bug 1843997)
Comment 4•1 year ago
•
|
||
I've tried various versions of batch allocations for nodes and so far haven't found a way to make it improve performance. But yes, if mozjemalloc provided some new API to avoid locking, that might work better.
Comment 5•1 year ago
|
||
(In reply to Paul Bone [:pbone] from comment #2)
I proposed this but it was rejected since the arena_id abstraction is there for security. No-body is really sure how much security it provides though. The discussion is on Bug 1810953. The latest suggestion was to use an RWLock but I found that it provided no perf benefit. I'd like to re-propose removing that abstraction again with some performance results. So I'd like to use this test as an example to motivate that.
It's worth looking closer, but I believe the performance cost of the lock is the atomic in a cache line that's shared between threads. i.e. the lock is not actually contended but the cache line is still shared. This would explain why using a rwlock doesn't help.
Comment 6•1 year ago
|
||
I took a closer look:
Here's the assembly for RtlEnterCriticalSection from https://share.firefox.dev/3K3tMkj:
sub rsp, 0x28
mov rax, qword gs:[0x30]
lock btr qword [rcx + 0x8], 0x0
mov rax, qword [rax + 0x48]
jnb $+0x13
mov qword [rcx + 0x10], rax
xor eax, eax
mov dword [rcx + 0xc], 0x1
add rsp, 0x28
ret
int 0x3
cmp qword [rcx + 0x10], rax
jnz $+0xb
inc dword [rcx + 0xc]
xor eax, eax
add rsp, 0x28
ret
int 0x3
call 0xe
jmp $-0x1e
We get 105 samples on the instruction after lock btr qword [rcx + 0x8], 0x0
. I'd say this confirms my theory that the cost here is not from actual lock contention but instead from cacheline sharing between threads.
Updated•1 year ago
|
Updated•1 year ago
|
Reporter | ||
Comment 7•1 year ago
|
||
Bug 1844359 improved the time in nsNodeInfoManager::Allocate
by 20%.
New profile: https://share.firefox.dev/47ExXx3 (4340 samples)
Description
•