Closed Bug 1609478 Opened 5 years ago Closed 2 years ago

Avoid doing large memsets to poison values (or to 0, etc) under the memory lock

Categories

(Core :: Memory Allocator, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
113 Branch
Performance Impact medium
Tracking Status
firefox113 --- fixed

People

(Reporter: jesup, Assigned: pbone)

References

(Blocks 3 open bugs)

Details

(Keywords: perf:responsiveness, Whiteboard: [sp3])

Attachments

(1 file, 1 obsolete file)

We've seen profiles where the MainThread is blocked for 10's of ms on the memory lock, because JS Helper's GC-freeing code is poisoning large blocks of memory (see https://faraday.basschouten.com/mozilla/jsmemcpysmall.html)

At least when freeing large blocks of memory, the overhead cost of unlocking around the memset is in the noise. For small blocks it's probably not needed much and might be counter-productive (especially for very small), though if we can move anything out of the lock in general, that's always good.

See for example: https://perfht.ml/2u1IJ2h
(from https://docs.google.com/spreadsheets/d/1v0w5OZy3MQxRVQj-557wqdV7FzehzIz0AaqRHMaoPFE/edit#gid=75257179)

Whiteboard: [qf] → [qf:p1:responsiveness]

Do you have a link to a profile? The one in comment 0 doesn't work.

Now, without a profile, I'll say this: the claim that "the MainThread is blocked for 10's of ms on the memory lock" is dubious. I can believe that's what it could look like in a sampled profile if many large allocations are freed in a succession on the main thread because of the likeliness to get a sample during memset.

But considering a large allocation is at most 1036288 bytes, spending > 10ms on a memset for that amount of memory sounds ridiculous. That would be less than 100MB/s. SHA1 is much more complex than memset and can do > 500MB/s.

I'm not saying we shouldn't take the patch (but I want to carefully think about the side effects), but I'm saying the premise seems bogus, and the profiles should probably be looked at more deeply.

Assignee: rjesup → nobody
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Performance Impact: --- → P1
Whiteboard: [qf:p1:responsiveness]

I'm going to reduce the priority on this bug until we have better proof of this being a problem. Speaking from my perspective, I remember seeing this issue show up in profiles every now and then but not very frequently.

We can increase the priority again when we've found more profiles that show this problem.

Performance Impact: P1 → P2

(In reply to Mike Hommey [:glandium] from comment #2)

Do you have a link to a profile? The one in comment 0 doesn't work.

It does work for me, fwiw. Here's the thread that contains the memset: https://share.firefox.dev/3c5uejI
On that thread you can see that the memset is interleaved with calls to arena_t::Purge, so it probably is indeed multiple allocations.

That would be less than 100MB/s

The memory could be paged out, or compressed, too. For memory which hasn't been touched in a long time, it seems odd to pay a cost during free() to page it all back into the cache.

Severity: normal → S3
See Also: → 1809610

Chrome's PartitionAlloc doesn't seem to poison on free

We found security bugs in macOS betas with our poisoning allocator, which makes me suspect we're one of the only big packages to do that ;-)

Blocks: 1809610
OS: Unspecified → All
Priority: -- → P1
Hardware: Unspecified → All
Whiteboard: [sp3]

Performance results

When I test on my workstation I get a speedometer score of 91.7 +- 2.7, before it was 89.7 +- 2.2. Speedometer is rather variable, but that's an improvment of ~2%. However when we use try:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=058ad9782751284e6af9b4a9103db1d6003b5594&newProject=try&newRevision=f228b80a83a528a838f46683ddd3edb726a4ee39&framework=13&page=1&showOnlyConfident=1&showOnlyImportant=1

This doesn't improve speedometer, or does by a negligible amount. On my workstation if I limit Firefox to a smaller number of cores or disable turbo boost the performance improvement goes away. I believe that this is because this patch reduces time spent with the arena lock held, the main thread is now less likely to block and so without this change when it became blocked it was an opportunity for the CPU to slow its clockspeed or the OS to reschedule the thread onto a different CPU core, both are now less likely with the patch applied. This change in performance may be more likely to show up on laptops and desktops rather than the Xeons we use for performance testing.

In the link to the tests above there are many load and visual metrics tests that do see an improvement though, and others that see a regression (it is after all a scheduling change).

Attachment #9121099 - Attachment is obsolete: true
Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9264c1bf87fa Poison memory we want to free before we take the arena lock r=glandium
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 113 Branch

Looks like it brought improvement to devtools tests:

== Change summary for alert #37704 (as of Sat, 18 Mar 2023 00:29:04 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
13% reload-inspector:content-process memory linux1804-64-tsan-qr 109,501,098.67 -> 95,255,210.67
5% damp server.protocoljs.DAMP windows10-64-shippable-qr e10s fission stylo webrender 1,353.39 -> 1,290.51
5% damp server.protocoljs.DAMP windows10-64-shippable-qr e10s fission stylo webrender-sw 1,355.33 -> 1,294.07

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=37704

Regressions: 1825837
Assignee: rjesup → pbone
Blocks: 1811985
Blocks: 1811784
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: