Avoid doing large memsets to poison values (or to 0, etc) under the memory lock
Categories
(Core :: Memory Allocator, enhancement, P1)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox113 | --- | fixed |
People
(Reporter: jesup, Assigned: pbone)
References
(Blocks 3 open bugs)
Details
(Keywords: perf:responsiveness, Whiteboard: [sp3])
Attachments
(1 file, 1 obsolete file)
We've seen profiles where the MainThread is blocked for 10's of ms on the memory lock, because JS Helper's GC-freeing code is poisoning large blocks of memory (see https://faraday.basschouten.com/mozilla/jsmemcpysmall.html)
At least when freeing large blocks of memory, the overhead cost of unlocking around the memset is in the noise. For small blocks it's probably not needed much and might be counter-productive (especially for very small), though if we can move anything out of the lock in general, that's always good.
See for example: https://perfht.ml/2u1IJ2h
(from https://docs.google.com/spreadsheets/d/1v0w5OZy3MQxRVQj-557wqdV7FzehzIz0AaqRHMaoPFE/edit#gid=75257179)
| Reporter | ||
Comment 1•5 years ago
|
||
Updated•5 years ago
|
Comment 2•5 years ago
|
||
Do you have a link to a profile? The one in comment 0 doesn't work.
Now, without a profile, I'll say this: the claim that "the MainThread is blocked for 10's of ms on the memory lock" is dubious. I can believe that's what it could look like in a sampled profile if many large allocations are freed in a succession on the main thread because of the likeliness to get a sample during memset.
But considering a large allocation is at most 1036288 bytes, spending > 10ms on a memset for that amount of memory sounds ridiculous. That would be less than 100MB/s. SHA1 is much more complex than memset and can do > 500MB/s.
I'm not saying we shouldn't take the patch (but I want to carefully think about the side effects), but I'm saying the premise seems bogus, and the profiles should probably be looked at more deeply.
| Reporter | ||
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 5•3 years ago
|
||
I'm going to reduce the priority on this bug until we have better proof of this being a problem. Speaking from my perspective, I remember seeing this issue show up in profiles every now and then but not very frequently.
We can increase the priority again when we've found more profiles that show this problem.
Comment 6•3 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #2)
Do you have a link to a profile? The one in comment 0 doesn't work.
It does work for me, fwiw. Here's the thread that contains the memset: https://share.firefox.dev/3c5uejI
On that thread you can see that the memset is interleaved with calls to arena_t::Purge, so it probably is indeed multiple allocations.
That would be less than 100MB/s
The memory could be paged out, or compressed, too. For memory which hasn't been touched in a long time, it seems odd to pay a cost during free() to page it all back into the cache.
Updated•3 years ago
|
Comment 7•2 years ago
|
||
FWIW, Chrome's PartitionAlloc doesn't seem to poison on free:
https://source.chromium.org/chromium/chromium/src/+/main:base/allocator/partition_allocator/partition_root.h;l=1331;drc=af8d5d2a6d2f816c1bf96481faae6349b92ae70d
i.e. they only unconditionally poison in debug builds and have some disabled support for poisoning 1% of allocations:
https://source.chromium.org/chromium/chromium/src/+/main:base/allocator/partition_allocator/partition_alloc_config.h;l=150;drc=af8d5d2a6d2f816c1bf96481faae6349b92ae70d
Comment 8•2 years ago
|
||
Chrome's PartitionAlloc doesn't seem to poison on free
We found security bugs in macOS betas with our poisoning allocator, which makes me suspect we're one of the only big packages to do that ;-)
| Assignee | ||
Updated•2 years ago
|
| Assignee | ||
Comment 9•2 years ago
|
||
Updated•2 years ago
|
Updated•2 years ago
|
| Assignee | ||
Comment 10•2 years ago
|
||
Performance results
When I test on my workstation I get a speedometer score of 91.7 +- 2.7, before it was 89.7 +- 2.2. Speedometer is rather variable, but that's an improvment of ~2%. However when we use try:
This doesn't improve speedometer, or does by a negligible amount. On my workstation if I limit Firefox to a smaller number of cores or disable turbo boost the performance improvement goes away. I believe that this is because this patch reduces time spent with the arena lock held, the main thread is now less likely to block and so without this change when it became blocked it was an opportunity for the CPU to slow its clockspeed or the OS to reschedule the thread onto a different CPU core, both are now less likely with the patch applied. This change in performance may be more likely to show up on laptops and desktops rather than the Xeons we use for performance testing.
In the link to the tests above there are many load and visual metrics tests that do see an improvement though, and others that see a regression (it is after all a scheduling change).
Updated•2 years ago
|
Comment 11•2 years ago
|
||
Comment 12•2 years ago
|
||
| bugherder | ||
Comment 13•2 years ago
|
||
Looks like it brought improvement to devtools tests:
== Change summary for alert #37704 (as of Sat, 18 Mar 2023 00:29:04 GMT) ==
Improvements:
| Ratio | Test | Platform | Options | Absolute values (old vs new) |
|---|---|---|---|---|
| 13% | reload-inspector:content-process memory | linux1804-64-tsan-qr | 109,501,098.67 -> 95,255,210.67 | |
| 5% | damp server.protocoljs.DAMP | windows10-64-shippable-qr | e10s fission stylo webrender | 1,353.39 -> 1,290.51 |
| 5% | damp server.protocoljs.DAMP | windows10-64-shippable-qr | e10s fission stylo webrender-sw | 1,355.33 -> 1,294.07 |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=37704
Updated•2 years ago
|
Description
•