Closed Bug 1920451 Opened 5 months ago Closed 7 days ago

Have heuristics to better predict when it is a good moment to purge

Categories

(Core :: Memory Allocator, enhancement)

enhancement

Tracking

()

RESOLVED FIXED
137 Branch
Tracking Status
firefox137 --- fixed

People

(Reporter: jstutte, Assigned: jstutte)

References

(Blocks 3 open bugs, Regressed 1 open bug)

Details

(Keywords: perf-alert, Whiteboard: [sp3])

Attachments

(3 files, 2 obsolete files)

Tests for bug 1903758 showed that most of the benefit of deferred purging comes from letting an arena re-use recently freed memory without ever giving it back to the OS.

We could introduce a concept of "last significant re-use" on each arena, that tells us how much time passed since we successfully re-used a page since the last purge request was issued, assuming that a recent re-use makes it more likely for other re-uses to come. It might be as easy as the last time that mNumDirty went down for re-use.

A similar concept could also be applied to chunk recycling.

Once we have that information, we can easily use this during moz_may_purge_one_now to determine if there is any arena (or the recycled chunk list) that needs purging (cleanup) by just iterating over the list of purge requests and comparing that timestamp with now, and (if requested) do the work only for those.

A first sketch how this could look like.

The basic idea is:

  • whenever we re-use more memory than the significance threshold we raise the bar of significance to that level of usage and update the timestamp
  • whenever we free memory, we reset the bar of significance to the new, lower level of usage but do not update the timestamp
  • when we may want to schedule an idle purge, we consider only purge requests that had no re-use for longer than a grace period

This should ensure that we do not count every up and down as a "significant reuse" but converge to a new average level that we want to purge for.

We do that per-arena based an mNumDirty as input for the threshold as well as for the recycled chunks list as a whole based on gRecycleSize.

Whiteboard: [sp3]

So I looked at memory reports from the regressing test, and it seems from the diff, that:

-7.95 MB (100.0%) -- heap
├──-15.70 MB (197.56%) -- decommitted
│  ├───-9.71 MB (122.17%) -- unused-pages
│  │   ├──-5.95 MB (74.83%) ── madvised [8]
│  │   └──-3.76 MB (47.35%) ── fresh [8]
│  └───-5.99 MB (75.39%) ── unmapped [8]
└───7.75 MB (-97.56%) -- committed
    ├──7.73 MB (-97.35%) ── unused-pages/dirty [8]   <<<<< !!!!!!
    └──0.02 MB (-00.22%) -- (3 tiny)
       ├──0.02 MB (-00.20%) ── bookkeeping [8]
       ├──0.01 MB (-00.11%) ── allocated [8]
       └──-0.01 MB (00.08%) ++ bin-unused

After lazy purge we see more committed unused-pages/dirty than without. The interesting thing is, that instead bin-unused is basically identical, so we do not see a higher fragmentation inside committed bin pages, AFAICT. I wonder, if purge was not able to coalesce enough dirty pages to make them become decommitted? Is there any magic constant we could tweak to be more aggressive about decommitting during purge ? I assume that arena_t::PurgeInfo::FindDirtyPages is somehow involved here ?

Note that in turn apparently we keep less fresh pages around, which kind of makes sense, given we want to re-use pages more often. Please note also that these measures are taken after a MozJemalloc::jemalloc_free_dirty_pages which is supposed to free ALL dirty pages, IIUC.

Flags: needinfo?(pbone)

Intreaguing...

FindDirtyPages() isn't changed by your patch and I can't imagine how it would be effected by your patch. It doesn't look for special pages or longer sets of pages. Its cleverness is limited to trying to do as little scanning as possible. For anyone else reading this who didn't see our conversation on slack. Dirty pages are the only kind of pages that it makes sense to purge, they're the only used page that is guaranteed to be resident. Fresh and madvised pages might be resident but we should assume they aren't since that's what OSs usually/historically do. The exception is madvised pages on MacOS, hence the double-purge logic.

I suspect, and based on the profiles I generated, that Purge just isn't being called as much. However that the test should call jemalloc_free_dirty_pages(), I don't see why it wouldn't work. Something is very strange with that perfhearder report. it is for "Base Content Resident Unique Memory Base Resident Unique Memory After tabs open [+30s, forced GC] opt fission" AFAIK it can't be both "Base" and "After tabs open"? When I click the link to go back to the main report I don't see it, but I find "Base Content Resident Unique Memory opt fission" with the same number, I think this is correct because 7-9MB fits "Base" and not "after tabs open", so I don't think jemalloc_free_dirty_pages() is getting called in that test.

Flags: needinfo?(pbone)

Thanks for taking a look. Yesterday evening I did a detailed investigation thanks to this pernosco profile and filed bug 1933648. tl;dr: There is no real regression, just noise from our measurements.

I think we can go forward with lazy purge (the patches might need some cleanup).

WRT the effect on page faults, :denispal made some Android profiles with the patch stack from lazy purge: Each profile ran through speedometer3 30 times. We see around a 30-40% reduction in page faults samples with lazy purge.

Base:
https://share.firefox.dev/3ZlED0X
https://share.firefox.dev/4923Uk5

Patches:
https://share.firefox.dev/4hX6hJ7
https://share.firefox.dev/3V6CRyr

Attachment #9427370 - Attachment description: WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. → WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug
Assignee: nobody → jstutte
Attachment #9427370 - Attachment description: WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug → Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug
Status: NEW → ASSIGNED
Attachment #9427370 - Attachment description: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug → WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug
No longer blocks: 1903758
Depends on: 1903758
Attachment #9427370 - Attachment description: WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug → Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug
Attachment #9427370 - Attachment description: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug → WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug
Attachment #9427370 - Attachment description: WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?glandium,pbone,smaug → Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Attachment #9427370 - Attachment description: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug → WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Attachment #9427370 - Attachment description: WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug → Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Attachment #9427370 - Attachment description: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug → WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Attachment #9427370 - Attachment description: WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug → Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Attachment #9427370 - Attachment description: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug → WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Attachment #9427370 - Attachment description: WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug → Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Attachment #9427370 - Attachment description: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug → WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Attachment #9427370 - Attachment description: WIP: Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug → Bug 1920451 - Introduce a concept of last significant re-use to decide when to purge. r?pbone,smaug
Blocks: 1948205
See Also: → 1948601
Attachment #9465328 - Attachment is obsolete: true
Blocks: 1949087
Blocks: 1947687
Attachment #9466163 - Attachment is obsolete: true
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f9e3f8796616 Introduce a concept of last significant re-use to decide when to purge. r=pbone,smaug
Status: ASSIGNED → RESOLVED
Closed: 7 days ago
Resolution: --- → FIXED
Target Milestone: --- → 137 Branch

4% improvement on Ares6-Air-first-iteration

Regressions: 1949668

(In reply to Pulsebot from comment #10)

Pushed by jstutte@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f9e3f8796616
Introduce a concept of last significant re-use to decide when to purge.
r=pbone,smaug

Perfherder has detected a browsertime performance change from push f9e3f8796616bf24fd9d1c7260bc126fe531d0e0.

Improvements:

Ratio Test Platform Options Absolute values (old vs new) Performance Profiles
8% speedometer jQuery-TodoMVC/Adding100Items/Sync windows11-64-24h2-shippable fission webrender 35.72 -> 32.75
8% speedometer jQuery-TodoMVC/Adding100Items/Sync windows11-64-24h2-nightlyasrelease fission webrender 34.57 -> 31.71
8% speedometer jQuery-TodoMVC/Adding100Items windows11-64-24h2-shippable fission webrender 37.82 -> 34.81
8% speedometer jQuery-TodoMVC/Adding100Items windows11-64-24h2-nightlyasrelease fission webrender 36.61 -> 33.73
8% speedometer jQuery-TodoMVC/Adding100Items/Sync windows11-64-nightlyasrelease-qr fission webrender 33.98 -> 31.40
... ... ... ... ... ...
2% speedometer Inferno-TodoMVC windows11-64-shippable-qr fission webrender 56.54 -> 55.18

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests.

If you need the profiling jobs you can trigger them yourself from treeherder job view or ask a sheriff to do that for you.

You can run these tests on try with ./mach try perf --alert 44001

For more information on performance sheriffing please see our FAQ.

Keywords: perf-alert
Regressions: 1950558
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: