Closed Bug 1874022 Opened 9 months ago Closed 4 months ago

The PHC check has too much overhead on arm64

Categories

(Core :: Memory Allocator, enhancement, P1)

ARM64
Unspecified
enhancement

Tracking

()

RESOLVED FIXED
129 Branch
Tracking Status
firefox129 --- fixed

People

(Reporter: mstange, Assigned: pbone)

References

(Blocks 1 open bug)

Details

Attachments

(14 files, 3 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

Profile: https://share.firefox.dev/4aOcGCP

https://speedometer-preview.netlify.app/?suite=TodoMVC-Lit-Complex-DOM&iterationCount=100 is spending too much time in MaybePageAlloc, specifically in this check:

https://searchfox.org/mozilla-central/rev/b1a029fadaaabb333d8139f9ec3924a20c0c941f/memory/build/PHC.cpp#1165-1168

int32_t newDelay = GAtomic::DecrementDelay();
if (newDelay != 0) {
  return nullptr;
}

I think the atomic should be revisited. In particular, I don't believe this sentence:

https://searchfox.org/mozilla-central/rev/b1a029fadaaabb333d8139f9ec3924a20c0c941f/memory/build/PHC.cpp#1146-1148

// It's critical that sAllocDelay has ReleaseAcquire semantics, because that
// guarantees that exactly one thread will see sAllocDelay have the value 0.
// (Relaxed semantics wouldn't guarantee that.)

Relaxed atomics are still atomic.

Thanks Markus.

I think we could use thread local variables here rather than atomics.

This bug is almost a duplicate of 1833807, but I'm going to make it "depends on" because once we do that bug we should re-test arm64.

Depends on: 1833807
Priority: -- → P2
Hardware: Unspecified → ARM64
See Also: → 1884293

Working on this now.

Assignee: nobody → pbone
Status: NEW → ASSIGNED
Priority: P2 → P1

Currently we have 3 structures containing mutable data seperated by their
lock-safe behaviour (locked, atomic and TLS). I'd prefer to associate
structures by what they mean. This patch moves the tlsIsDisabled
variable, the only one in GTls into GMutable and removes the GTls
class.

This also means that we can remove the logic that checks if it has wrapped
if PHC was disabled on the current thread since that won't happen due to
other threads. The patch also adds code that has to initialise the delay
on each thread.

Testing the patches first on x86_64 (Ryzen 4 Threadripper) I got the following results for the subtest above (https://speedometer-preview.netlify.app/?suite=TodoMVC-Lit-Complex-DOM&iterationCount=100)

Central: 28.1 +- 0.32
Tip: 28.5 +- 0.36
Disabled: 27.8 +- 0.32

Maybe smaller numbers are better in which case my patch makes this slower?

But looking at the profiles, the overhead of this code goes from 2.2% to 1.5%, so either my patch makes this better or it hits a bottleneck elsewhere.

Central: https://share.firefox.dev/3U4JY9l
Tip: https://share.firefox.dev/4aDsQ0X

I'll run tests on my Mac Mini M1 soon.

On my Mac M1

Central: 38.7 +- 0.40
Tip: 39.2 +- 0.55
PHC disabled: 39.8 +- 0.46

So my patch helps.

Profiles:
Central: https://share.firefox.dev/49n79kA
Tip: https://share.firefox.dev/3vCs45w

it seems to drop from 1.2% to 0.6% here (you have to find pthread_getspecific and pthread_setspecific)

I also noticed that in my "central" profile the branch outside DecrementDelay() showed up with a moderate number of samples, but in the new profile it had zero. So this has improved the cost of MaybePageAlloc(), not only DecrementDelay()

On ARM the atomic used for Now() didn't show up at all. it is Relaxed.

Attachment #9394705 - Attachment description: WIP: Bug 1874022 - Remove the GTls structure → Bug 1874022 - pt 1. Remove the GTls structure r=glandium
Attachment #9394706 - Attachment description: WIP: Bug 1874022 - Move sNow into GMut → Bug 1874022 - pt 2. Move sNow into GMut r=glandium
Attachment #9394707 - Attachment description: WIP: Bug 1874022 - Move sAllocDelay into GMut → Bug 1874022 - pt 3. Move sAllocDelay into GMut r=glandium
Attachment #9394708 - Attachment description: WIP: Bug 1874022 - Make the allocation delay thread-local → Bug 1874022 - pt 4. Make the allocation delay thread-local r=glandium

Rename this structure to PHCRegion and move its instance into PHC.

Attachment #9394708 - Attachment description: Bug 1874022 - pt 4. Make the allocation delay thread-local r=glandium → WIP: Bug 1874022 - pt 9. Make the allocation delay thread-local
Attachment #9397067 - Attachment description: WIP: Bug 1874022 - pt 6. Rewrite maybe_init() to only require a single memory access → WIP: Bug 1874022 - pt 10. Rewrite maybe_init() to only require a single memory access

This patch uses the wrong approach

Attachment #9397068 - Attachment description: WIP: Bug 1874022 - pt 7. Inline the hot path from MaybePageAlloc → WIP: Bug 1874022 - pt 12. Inline the hot path from MaybePageAlloc
Attachment #9397069 - Attachment description: WIP: Bug 1874022 - pt 8. Reset the alloc delay even when PHC is disabled → WIP: Bug 1874022 - pt 13. Reset the alloc delay even when PHC is disabled
Attachment #9397070 - Attachment description: WIP: Bug 1874022 - pt 9. Seperate arena and non-arena code paths in PHC → WIP: Bug 1874022 - pt 14. Seperate arena and non-arena code paths in PHC
Attachment #9401886 - Attachment description: WIP: Bug 1874022 - pt 4. Rename the GMut structure and move gMut to a static member → Bug 1874022 - pt 4. Rename the GMut structure and move gMut to a static member r=glandium
Attachment #9401888 - Attachment description: WIP: Bug 1874022 - pt 6. Rename the GConst structure and move gConst → Bug 1874022 - pt 6. Rename the GConst structure and move gConst r=glandium
Attachment #9401889 - Attachment description: WIP: Bug 1874022 - pt 7. Move the mutex to a member variable → Bug 1874022 - pt 7. Move the mutex to a member variable r=glandium
Attachment #9401890 - Attachment description: WIP: Bug 1874022 - pt 8. Add a proof of lock argument to a function → Bug 1874022 - pt 8. Add a proof of lock argument to a function r=glandium
Attachment #9397066 - Attachment is obsolete: true
Attachment #9394708 - Attachment description: WIP: Bug 1874022 - pt 9. Make the allocation delay thread-local → Bug 1874022 - pt 9. Make the allocation delay thread-local r=glandium
Attachment #9397067 - Attachment description: WIP: Bug 1874022 - pt 10. Rewrite maybe_init() to only require a single memory access → Bug 1874022 - pt 10. maybe_init() uses a single memory access r=glandium
Attachment #9401891 - Attachment description: WIP: Bug 1874022 - pt 11. WIP avoid falsesharing WIP → Bug 1874022 - pt 11. Improve cache locality r=glandium
Duplicate of this bug: 1833807
Attachment #9401890 - Attachment description: Bug 1874022 - pt 8. Add a proof of lock argument to a function r=glandium → Bug 1874022 - pt 8. Add a proof of lock argument to ResetRNG r=glandium
Blocks: PHC2023
Attachment #9394708 - Attachment description: Bug 1874022 - pt 9. Make the allocation delay thread-local r=glandium → Bug 1874022 - pt 9. Reduce the cost of updating sAllocDelay r=glandium
Attachment #9397069 - Attachment is obsolete: true
Attachment #9397068 - Attachment description: WIP: Bug 1874022 - pt 12. Inline the hot path from MaybePageAlloc → Bug 1874022 - pt 12. Inline the hot path from MaybePageAlloc r=glandium
Attachment #9397070 - Attachment description: WIP: Bug 1874022 - pt 14. Seperate arena and non-arena code paths in PHC → Bug 1874022 - pt 13. Seperate arena and non-arena code paths in PHC r=glandium
Attachment #9397070 - Attachment is obsolete: true
Attachment #9397070 - Attachment is obsolete: false
Attachment #9397070 - Attachment description: Bug 1874022 - pt 13. Seperate arena and non-arena code paths in PHC r=glandium → Bug 1874022 - pt 14. Seperate arena and non-arena code paths in PHC r=glandium
Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1f2e2a375881 pt 1. Remove the GTls structure r=glandium https://hg.mozilla.org/integration/autoland/rev/ecae8e56c5cc pt 2. Move sNow into GMut r=glandium https://hg.mozilla.org/integration/autoland/rev/d1f91fdc7c68 pt 3. Move sAllocDelay into GMut r=glandium https://hg.mozilla.org/integration/autoland/rev/6248406a1453 pt 4. Rename the GMut structure and move gMut to a static member r=glandium https://hg.mozilla.org/integration/autoland/rev/2092fb5bf23b pt 5. Fix some outdated comments r=glandium https://hg.mozilla.org/integration/autoland/rev/da3d695e2d65 pt 6. Rename the GConst structure and move gConst r=glandium https://hg.mozilla.org/integration/autoland/rev/6c5254f3df9f pt 7. Move the mutex to a member variable r=glandium https://hg.mozilla.org/integration/autoland/rev/d6539e603aa5 pt 8. Add a proof of lock argument to ResetRNG r=glandium https://hg.mozilla.org/integration/autoland/rev/41762a317aae pt 9. Reduce the cost of updating sAllocDelay r=glandium https://hg.mozilla.org/integration/autoland/rev/6ace59776780 pt 10. maybe_init() uses a single memory access r=glandium https://hg.mozilla.org/integration/autoland/rev/dc512f63eaa6 pt 11. Improve cache locality r=glandium https://hg.mozilla.org/integration/autoland/rev/2a81039219c9 pt 12. Inline the hot path from MaybePageAlloc r=glandium https://hg.mozilla.org/integration/autoland/rev/fa1129aa4279 pt 13. Inline the hot path for free r=glandium https://hg.mozilla.org/integration/autoland/rev/f72b2832f50f pt 14. Seperate arena and non-arena code paths in PHC r=glandium
Attachment #9409332 - Attachment is obsolete: true
Blocks: 1905805
Blocks: 1885102
Regressions: 1906409
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: