Closed Bug 1669392 Opened 4 years ago Closed 3 years ago

Add more size classes between 512B and the page size

Categories

(Core :: Memory Allocator, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
95 Branch
Fission Milestone Future
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 --- wontfix
firefox92 --- wontfix
firefox93 --- wontfix
firefox94 --- wontfix
firefox95 --- fixed

People

(Reporter: pbone, Assigned: pbone)

References

(Blocks 1 open bug)

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink] fission-memory [fission:m95])

Attachments

(4 files, 3 obsolete files)

In a normal configuration (4KiB pages) jemalloc uses power-of-two size classes between 512 bytes and 4KiB. Thanks to Bug 1640309. for a process loading example.com we can see:

bin slop used percent
496 136 9920 1%
512 528 98304 1%
1024 156226 634880 25%
2048 178876 643072 28%
large 385725 3338240 12%

The table shows slop as a fraction of allocated memory per size class. In the classes 1024, 2048 and (not shown) 4096 the slop is much higher.

(Comment edited to mark up the table properly)

I tried a few different size classes, a size class every 256 bytes is the best. But I think we can do even better than this by reconsidering all the size classes below 4KiB (but I'm happy to do that in a follow-up bug). I measured this with the memory replay tool.

Num bins Quantium Allocated (kb) Waste (KB) Dirty (KB) Bookkeep (KB) Committed (KB) Bin unused (KB) Slop (KB) unused+slop u/s diff commited diff
3 pow-of-2 7,317 516 592 178 9,156 553 847 1,400 0.0% 0.0%
32 128 6,841 516 500 184 8,964 923 371 1,294 7.6% 2.1%
16 256 6,892 516 436 180 8,756 732 422 1,154 17.6% 4.4%
8 512 7,060 516 560 184 8,992 671 590 1,261 9.9% 1.8%

The first row is the current configuration of using power-of-2 bin sizes in this range (1024, 2048, 4096).

Are you taking into account the fact that for one allocation of each size, the minimum size actually used becomes the page size? Counter intuitively, this could have the complete reverse effect than expected.

It might be better to see why we're allocating these odd > 512 < 4K sizes.

Summary: Add more size classes between 512B and 4KiB → Add more size classes between 512B and the page size

Also, this has the potential to increase fragmentation substantially (and thus RSS).

The fact that the "Allocated" numbers vary so much in comment 1 makes me doubt the entire table. It would also be necessary to see the full picture with more sites, after loading/unloading, multiple tabs, etc. An AWSY run would be a good start for that, but until those per-bin stats are exposed to about:memory, AWSY is not going to give that information... I'd suggest doing a local AWSY run with logging enabled and replay those logs.

(In reply to Mike Hommey [:glandium] from comment #2)

Are you taking into account the fact that for one allocation of each size, the minimum size actually used becomes the page size? Counter intuitively, this could have the complete reverse effect than expected.

Right, for a process with very few live allocations this would make things worse. We need to balance slop with bin-unused. Which is why I wanted to measure both.

It might be better to see why we're allocating these odd > 512 < 4K sizes.

I've had a look at slop in DMD and found one thing I've fixed, another that I'm investigating (Bug 1662345), the remaining ones are things that are a dynamic size anyway. Like JS script code (bytecode I think). Rather than constant sized allocations. However:

  • I should filter for this range when looking at the DMD output.
  • I should try a larger process, I've been "optimising for" small processes since that's the case we're interested in for Fission, but I can look at larger ones too.

(In reply to Mike Hommey [:glandium] from comment #3)

Also, this has the potential to increase fragmentation substantially (and thus RSS).

Good point. I'll test with longer-running processes that do a few navigations.

Apologies, I want to break up your paragraph to make my reply clearer.

(In reply to Mike Hommey [:glandium] from comment #4)

The fact that the "Allocated" numbers vary so much in comment 1 makes me doubt the entire table.

I generated this table by replaying the same log file of allocations.

Allocated can vary since it measures the full cell size, not the requested size. From the table in comment 0 the sum of the slop column is at least 327KiB, that doesn't include the allocations that get rounded up to 4096 since I excluded that row. I think Allocated can vary that much.

It would also be necessary to see the full picture with more sites, after loading/unloading, multiple tabs, etc. An AWSY run would be a good start for that, but until those per-bin stats are exposed to about:memory, AWSY is not going to give that information... I'd suggest doing a local AWSY run with logging enabled and replay those logs.

Yes, I havn't tested what happens for a longer-lived process. Running AWSY locally is a good idea. Thanks.

I ran AWSY in try here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b37bbae0a0c0c017f4eceabee3b8e52132d8b04c&newProject=try&newRevision=01a730fb4d3d464566f1410effee400919778190&framework=4

My next step is to tidy up the patch so you can see it.

BTW, considering Apple Silicon macs are going to have 16KB pages, you'll also want to measure how this goes with 16KB pages (you can fake that by setting the static page size to 16KB on whatever OS you're using)

Blocks: 1640309
No longer blocks: memshrink-content
URL: 1656155
See Also: 1640309
URL: 1656155
Depends on: 1670188
See Also: → 1656155

My list of things to test with this change that haven't been tested/answered yet is:

  • Test on long-running processes:
    ** Try processes that do a new pageload each navigation (eg, browsing a news site/reddit)
    ** Try single page apps.
    ** JS games / something that does a lot of processing in JS? (I'm thinking of object churn, but it may be more small objects here).
  • Test on large processes
    ** facebook?
    ** google docs (many tabs for different documents?)
    ** Some kind of large document? a very detailed SVG file?
  • Test with 16KiB pages

Any other ideas, sites w/ patterns we've thought of in the past for these cases?

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(continuation)

I'm not sure. Long running processes (beyond AWSY) is something we've never tested very well.

Flags: needinfo?(continuation)

My patch had a limitation preventing 16KIB and other page sizes form working. With it fixed the results look like:

I recorded a process browsing wikipedia, following several links until the gziped log file was 100MB. Then captured a memory report and killed the browser. Here is the amount of committed memory for each configuration (I have the other results too, but committed is the fairest):

                4KiB pages  16KiB pages
Without patch:   68,732KiB    83,312KiB
With patch:      67,004KiB    80,320KiB

The patch is a 3% win in this test with 4KiB pages and a 4% win with 16KiB pages.

Just switching from 4KiB pages to 16KiB pages is a memory regression of 21%.

With more testing for different sizes & lifetimes of processes we have jemalloc committed memory for each process

            Before  After   Improvement
                 
Wikipedia   67.12   65.43   2.51%
Facebook    492.13  486.31  1.18%
example.com 8.85    8.57    3.18%
slack       113.66  110.94  2.40%
google      545.00  530.98  2.57%
parent      257.27  253.37  1.52%
prealloc    6.39    5.86    8.31%
socket      2.16    2.25   -4.15%
privileged  22.41   22.41   0.02%

Mean                        1.95%
(excluding singleton):      3.36%
  • Wikipedia: Follow about 10 links to different articles, each new article causes a new pageload.
  • Facebook: Scroll the news feed, react to some posts, leave idle for some time, reload the news feed and scroll & react again.
  • example.com: Load the page
  • slack: Read "All Unread" clocking on some threads to read them.
  • google: Label and archive some e-mails and open 3 google docs, edit one of these docs.

The total amount of reduced memory (even though 'Facebook' was captured for a different browser session is 1.91%. The average saving per process is 1.95%. If we exclude the "singleton" processes, those that the browser has only one of like socket, main and privilegedabout, then the average is 3.36%.

I'm confident that this is a clear win for memory saving.

Flags: needinfo?(mh+mozilla)

Can you check actual RSS rather than committed?

Flags: needinfo?(pbone)
Fission Milestone: --- → M7

(In reply to Mike Hommey [:glandium] from comment #18)

Can you check actual RSS rather than committed?

The same using RSS of the logalloc-replay tool excluding logalloc-replay's own dynamic memory.

            Before  After   MiB   Percent
Wikipedia   69.82   67.77   2.05    2.94%
Facebook    494.34  488.88  5.46    1.10%
example.com 11.52   11.28   0.24    2.10%
slack       117.18  113.46  3.72    3.17%
google      541.25  526.38  14.87   2.75%
parent      259.61  254.79  4.82    1.86%
prealloc    9.07    8.53    0.54    5.99%
socket      4.27    4.25    0.02    0.37%
privileged  25.11   25.08   0.03    0.11%
Mean                                2.26%
Mean ex1                            3.01%
Flags: needinfo?(pbone)

I'm kind of surprised you're getting such large RSSes at all, considering (and I had forgotten) that logalloc-replay doesn't memset() the allocated memory (since bug 1423000), so allocated memory is never actually committed unless you enable zero or junk.

Flags: needinfo?(pbone)
Severity: -- → N/A
Fission Milestone: M7 → MVP
Whiteboard: [MemShrink] → [MemShrink] fission-memory

We need Bug 1671114 to measure the benefit of this change.

Depends on: 1671114
Attachment #9181196 - Attachment is obsolete: true
Attachment #9180104 - Attachment description: Bug 1669392 - pt 1. Add more jemalloc size classes r=glandium → Bug 1669392 - Add more jemalloc size classes r=glandium
Depends on: 1713271

Here's the updated RSS data

RSS (before) RSS (after) delta percent
example 8,508 8,220 288 3.39%
extension 15,396 15,524 -128 -0.83%
Fb-parent 231,156 220,344 10,812 4.68%
Fb 503,512 497,400 6,112 1.21%
google 557,944 543,100 14,844 2.66%
parent 263,788 259,032 4,756 1.80%
prealloc 5,964 5,428 536 8.99%
slack 116,624 112,908 3,716 3.19%
socket 1,624 1,544 80 4.93%
wiki 68,244 66,004 2,240 3.28%

This is almost always a benefit, saving sometimes up to 5% of a processes' memory usage.

Flags: needinfo?(pbone)

The next thing to test is a long browser session (eg 4 hours) to see if this negatively impacts fragmentation drastically.

I have tested a longer browser session (typical evening firefox usage for me) to check that there's no regression with these changes due to fragmentation - and none of the processes showed any regresion in RSS or bin-unused+swap (not shown).

There's the results from memory-replay at the end of the session for each of the processes.

RSS (mc) RSS patches Delta Percent
parent 1,589,548 1,574,680 14,868 0.94%
discord 140,788 137,112 3,676 2.61%
facebook 103,804 102,980 824 0.79%
github 75,960 69,660 6,300 8.29%
todoist 63,336 61,832 1,504 2.37%
twitter 151,728 146,480 5,248 3.46%
youtube 377,444 374,280 3,164 0.84%
total 2,502,608 2,467,024 35,584 1.42%

After pressing "Minimise memory usage" at the end of the session, extra usage due to fragmentation would have shown up here if anywhere.

RSS (mc) RSS patches Delta Percent
parent 1,015,704 999,272 16,432 1.62%
discord 113,580 111,100 2,480 2.18%
facebook 94,564 93,784 780 0.82%
github 46,604 45,196 1,408 3.02%
twitter 143,912 139,444 4,468 3.10%
youtube 367,224 363,912 3,312 0.90%

I ran some AWSY tests:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=9122dd221e6801f1db419147af8d981b71829b31&newProject=try&newRevision=68ab6bb4f439f4751497ba695c6a1514d2d65716&framework=4

Explicit base content memory has increased, but that's the trade-off these patches make, they increase fragmentation in order to decrease sloppy allocations, but overall reduce resident memory. We can verify that this is fragmentation in the memory reports for the base explicit memory shows there is an increase in bin-unused.

Some of the tests show a regression for resident memory, such as for windows, but viewing their subtests: https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=68ab6bb4f439f4751497ba695c6a1514d2d65716&originalSignature=2240017&newSignature=2240017&framework=4&originalRevision=9122dd221e6801f1db419147af8d981b71829b31 show that the "tabs closed" subtest is bringing the score down. Although that's a big difference, after the forced GC the regression disappears and returns to a win. In most of the other tests the regression disappears without the forced GC (only needing the 30 seconds). This may mean that the browser doesn't return memory quickly after closing tabs but on the whole it uses less memory. This is also a symptom of fragmentation since a single allocation can keep a chunk allocated.

The case I wanted to optimise is a content process running example.com, but AWSY doesn't test this. I tested it above with logalloc-replay and it showed a 288KB improvment. But when I test by comparing memory reports it's 100KB worse. That's not the result I'd hoped for, still there's a lot of improvment for larger content processes.

Any further thoughts/reviews for this glandium?

Thanks.

Flags: needinfo?(mh+mozilla)

This bug is a soft blocker for Fission MVP. We'd like to fix it before our Release channel rollout, but we won't delay the rollout waiting for it.

Whiteboard: [MemShrink] fission-memory → [MemShrink] fission-memory fission-soft-blocker
Flags: needinfo?(mh+mozilla)

Revert the SubPage size class to it's original power-of-two sizing and make
the 512-4KiB range a 2nd Quamtum-spaced size class.

All the ranges defined for the size classes are now inclusive of there upper
bound to make them consistent.

Depends on D92729

Attachment #9240222 - Attachment is obsolete: true

Setting status-firefox94=wontfix. Since the Nightly 94 code freeze is this week, Paul plans to wait until Nightly 95 to land these malloc changes.

Fission Milestone: MVP → Future
Priority: P1 → P2
Whiteboard: [MemShrink] fission-memory fission-soft-blocker → [MemShrink] fission-memory [fission:m95]

jemalloc_stats takes an array for its second argument. It expects this
array to have enough space for all the bins, previously the maximum was set
as a magic number. To make it dependent on the configured bins this patch
replaces the compile-time constant with a function.

Depends on D92729

Blocks: 1735250

Comment on attachment 9244690 [details]
Bug 1669392 - Provide a less-magic array size for jemalloc_stats r=glandium

Revision D127761 was moved to bug 1735250. Setting attachment 9244690 [details] to obsolete.

Attachment #9244690 - Attachment is obsolete: true
Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9de9bd47c061
Add more jemalloc size classes r=glandium

Backed out changeset 9de9bd47c061 (Bug 1669392) for causing build bustages.
Backout link
Push with failures - B
Failure Log

Flags: needinfo?(pbone)

oh, I need to move some code between patches.

Flags: needinfo?(pbone)
Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8679a50bd45a
Add more jemalloc size classes r=glandium
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch
Blocks: 1735715

== Change summary for alert #31892 (as of Fri, 15 Oct 2021 09:51:13 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
14% perf_reftest_singletons link-style-cache-1.html macosx1014-64-shippable-qr e10s fission stylo webrender 1,026.34 -> 879.62
13% perf_reftest_singletons link-style-cache-1.html macosx1014-64-shippable-qr e10s stylo webrender 1,020.05 -> 886.26
8% perf_reftest_singletons link-style-cache-1.html linux1804-64-shippable-qr e10s fission stylo webrender 472.13 -> 434.69
7% perf_reftest_singletons inline-style-cache-1.html macosx1014-64-shippable-qr e10s stylo webrender 1,721.18 -> 1,599.22
7% perf_reftest_singletons inline-style-cache-1.html macosx1014-64-shippable-qr e10s fission stylo webrender 1,714.59 -> 1,601.56
6% perf_reftest_singletons link-style-cache-1.html linux1804-64-shippable-qr e10s fission stylo webrender 475.06 -> 446.11

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=31892

Blocks: 1738240
Regressions: 1735482
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: