Closed Bug 1669392 Opened 5 years ago Closed 4 years ago

Add more size classes between 512B and the page size

Tracking

()

Status:

RESOLVED FIXED

Milestone:

95 Branch

Project Flags:

Fission Milestone

Future

Tracking Flags:

Tracking

Status

firefox-esr78

---

wontfix

firefox-esr91

---

wontfix

firefox92

---

wontfix

firefox93

---

wontfix

firefox94

---

wontfix

firefox95

---

fixed

People

(Reporter: pbone, Assigned: pbone)

References

(Blocks 1 open bug)

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink] fission-memory [fission:m95])

Attachments

(4 files, 3 obsolete files)

Allocation log for example.com process 5 years ago Paul Bone [:pbone] 1.98 MB, application/x-bzip		Details
jemalloc_stats report for default settings 5 years ago Paul Bone [:pbone] 10.07 KB, text/plain		Details
jemalloc_stats report for 256KiB spaced bins 5 years ago Paul Bone [:pbone] 15.48 KB, text/plain		Details
Bug 1669392 - Add more jemalloc size classes r=glandium 5 years ago Paul Bone [:pbone] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1669392 - pt 2. The maximum subpage size class is no-longer page_size/2 r=glandium 5 years ago Paul Bone [:pbone] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1669392 - Add a 2nd Quantum-spaced size class r=glandium 4 years ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1669392 - Provide a less-magic array size for jemalloc_stats r=glandium 4 years ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review

Paul Bone [:pbone]

Assignee

Description

•

5 years ago

•

Edited

In a normal configuration (4KiB pages) jemalloc uses power-of-two size classes between 512 bytes and 4KiB. Thanks to Bug 1640309. for a process loading example.com we can see:

bin	slop	used	percent
496	136	9920	1%
512	528	98304	1%
1024	156226	634880	25%
2048	178876	643072	28%
large	385725	3338240	12%

The table shows slop as a fraction of allocated memory per size class. In the classes 1024, 2048 and (not shown) 4096 the slop is much higher.

(Comment edited to mark up the table properly)

Paul Bone [:pbone]

Assignee

Comment 1

•

5 years ago

•

Edited

I tried a few different size classes, a size class every 256 bytes is the best. But I think we can do even better than this by reconsidering all the size classes below 4KiB (but I'm happy to do that in a follow-up bug). I measured this with the memory replay tool.

Num bins	Quantium	Allocated (kb)	Waste (KB)	Dirty (KB)	Bookkeep (KB)	Committed (KB)	Bin unused (KB)	Slop (KB)	unused+slop	u/s diff	commited diff
3	pow-of-2	7,317	516	592	178	9,156	553	847	1,400	0.0%	0.0%
32	128	6,841	516	500	184	8,964	923	371	1,294	7.6%	2.1%
16	256	6,892	516	436	180	8,756	732	422	1,154	17.6%	4.4%
8	512	7,060	516	560	184	8,992	671	590	1,261	9.9%	1.8%

The first row is the current configuration of using power-of-2 bin sizes in this range (1024, 2048, 4096).

Mike Hommey [:glandium]

Comment 2

•

5 years ago

Are you taking into account the fact that for one allocation of each size, the minimum size actually used becomes the page size? Counter intuitively, this could have the complete reverse effect than expected.

It might be better to see why we're allocating these odd > 512 < 4K sizes.

Summary: Add more size classes between 512B and 4KiB → Add more size classes between 512B and the page size

Mike Hommey [:glandium]

Comment 3

•

5 years ago

Also, this has the potential to increase fragmentation substantially (and thus RSS).

Mike Hommey [:glandium]

Comment 4

•

5 years ago

The fact that the "Allocated" numbers vary so much in comment 1 makes me doubt the entire table. It would also be necessary to see the full picture with more sites, after loading/unloading, multiple tabs, etc. An AWSY run would be a good start for that, but until those per-bin stats are exposed to about:memory, AWSY is not going to give that information... I'd suggest doing a local AWSY run with logging enabled and replay those logs.

Paul Bone [:pbone]

Assignee

Comment 5

•

5 years ago

(In reply to Mike Hommey [:glandium] from comment #2)

Are you taking into account the fact that for one allocation of each size, the minimum size actually used becomes the page size? Counter intuitively, this could have the complete reverse effect than expected.

Right, for a process with very few live allocations this would make things worse. We need to balance slop with bin-unused. Which is why I wanted to measure both.

It might be better to see why we're allocating these odd > 512 < 4K sizes.

I've had a look at slop in DMD and found one thing I've fixed, another that I'm investigating (Bug 1662345), the remaining ones are things that are a dynamic size anyway. Like JS script code (bytecode I think). Rather than constant sized allocations. However:

I should filter for this range when looking at the DMD output.
I should try a larger process, I've been "optimising for" small processes since that's the case we're interested in for Fission, but I can look at larger ones too.

Paul Bone [:pbone]

Assignee

Comment 6

•

5 years ago

(In reply to Mike Hommey [:glandium] from comment #3)

Also, this has the potential to increase fragmentation substantially (and thus RSS).

Good point. I'll test with longer-running processes that do a few navigations.

Paul Bone [:pbone]

Assignee

Comment 7

•

5 years ago

Apologies, I want to break up your paragraph to make my reply clearer.

(In reply to Mike Hommey [:glandium] from comment #4)

The fact that the "Allocated" numbers vary so much in comment 1 makes me doubt the entire table.

I generated this table by replaying the same log file of allocations.

Allocated can vary since it measures the full cell size, not the requested size. From the table in comment 0 the sum of the slop column is at least 327KiB, that doesn't include the allocations that get rounded up to 4096 since I excluded that row. I think Allocated can vary that much.

It would also be necessary to see the full picture with more sites, after loading/unloading, multiple tabs, etc. An AWSY run would be a good start for that, but until those per-bin stats are exposed to about:memory, AWSY is not going to give that information... I'd suggest doing a local AWSY run with logging enabled and replay those logs.

Yes, I havn't tested what happens for a longer-lived process. Running AWSY locally is a good idea. Thanks.

I ran AWSY in try here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b37bbae0a0c0c017f4eceabee3b8e52132d8b04c&newProject=try&newRevision=01a730fb4d3d464566f1410effee400919778190&framework=4

My next step is to tidy up the patch so you can see it.

Mike Hommey [:glandium]

Comment 8

•

5 years ago

BTW, considering Apple Silicon macs are going to have 16KB pages, you'll also want to measure how this goes with 16KB pages (you can fake that by setting the static page size to 16KB on whatever OS you're using)

Paul Bone [:pbone]

Assignee

Comment 9

•

5 years ago

Attached file Allocation log for example.com process — Details

Paul Bone [:pbone]

Assignee

Comment 10

•

5 years ago

Attached file jemalloc_stats report for default settings — Details

Paul Bone [:pbone]

Assignee

Comment 11

•

5 years ago

Attached file jemalloc_stats report for 256KiB spaced bins — Details

Paul Bone [:pbone]

Assignee

Comment 12

•

5 years ago

Attached file Bug 1669392 - Add more jemalloc size classes r=glandium — Details

Paul Bone [:pbone]

Assignee

Updated

•

5 years ago

Blocks: 1640309
No longer blocks: memshrink-content

URL: 1656155

Updated

•

5 years ago

Blocks: memshrink-content

Paul Bone [:pbone]

Assignee

Updated

•

5 years ago

URL: 1656155

Depends on: 1670188

Comment 13

•

5 years ago

My list of things to test with this change that haven't been tested/answered yet is:

Test on long-running processes:
** Try processes that do a new pageload each navigation (eg, browsing a news site/reddit)
** Try single page apps.
** JS games / something that does a lot of processing in JS? (I'm thinking of object churn, but it may be more small objects here).
Test on large processes
** facebook?
** google docs (many tabs for different documents?)
** Some kind of large document? a very detailed SVG file?
Test with 16KiB pages

Any other ideas, sites w/ patterns we've thought of in the past for these cases?

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 14

•

5 years ago

I'm not sure. Long running processes (beyond AWSY) is something we've never tested very well.

Flags: needinfo?(continuation)

Paul Bone [:pbone]

Assignee

Comment 15

•

5 years ago

Attached file Bug 1669392 - pt 2. The maximum subpage size class is no-longer page_size/2 r=glandium (obsolete) — Details

Paul Bone [:pbone]

Assignee

Comment 16

•

5 years ago

My patch had a limitation preventing 16KIB and other page sizes form working. With it fixed the results look like:

I recorded a process browsing wikipedia, following several links until the gziped log file was 100MB. Then captured a memory report and killed the browser. Here is the amount of committed memory for each configuration (I have the other results too, but committed is the fairest):

                4KiB pages  16KiB pages
Without patch:   68,732KiB    83,312KiB
With patch:      67,004KiB    80,320KiB

The patch is a 3% win in this test with 4KiB pages and a 4% win with 16KiB pages.

Just switching from 4KiB pages to 16KiB pages is a memory regression of 21%.

Paul Bone [:pbone]

Assignee

Comment 17

•

5 years ago

•

Edited

With more testing for different sizes & lifetimes of processes we have jemalloc committed memory for each process

            Before  After   Improvement
                 
Wikipedia   67.12   65.43   2.51%
Facebook    492.13  486.31  1.18%
example.com 8.85    8.57    3.18%
slack       113.66  110.94  2.40%
google      545.00  530.98  2.57%
parent      257.27  253.37  1.52%
prealloc    6.39    5.86    8.31%
socket      2.16    2.25   -4.15%
privileged  22.41   22.41   0.02%

Mean                        1.95%
(excluding singleton):      3.36%

Wikipedia: Follow about 10 links to different articles, each new article causes a new pageload.
Facebook: Scroll the news feed, react to some posts, leave idle for some time, reload the news feed and scroll & react again.
example.com: Load the page
slack: Read "All Unread" clocking on some threads to read them.
google: Label and archive some e-mails and open 3 google docs, edit one of these docs.

The total amount of reduced memory (even though 'Facebook' was captured for a different browser session is 1.91%. The average saving per process is 1.95%. If we exclude the "singleton" processes, those that the browser has only one of like socket, main and privilegedabout, then the average is 3.36%.

I'm confident that this is a clear win for memory saving.

Paul Bone [:pbone]

Assignee

Updated

•

5 years ago

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 18

•

5 years ago

Can you check actual RSS rather than committed?

Flags: needinfo?(pbone)

Neha Kochar [:neha]

Updated

•

5 years ago

Fission Milestone: --- → M7

Paul Bone [:pbone]

Assignee

Comment 19

•

5 years ago

(In reply to Mike Hommey [:glandium] from comment #18)

Can you check actual RSS rather than committed?

The same using RSS of the logalloc-replay tool excluding logalloc-replay's own dynamic memory.

            Before  After   MiB   Percent
Wikipedia   69.82   67.77   2.05    2.94%
Facebook    494.34  488.88  5.46    1.10%
example.com 11.52   11.28   0.24    2.10%
slack       117.18  113.46  3.72    3.17%
google      541.25  526.38  14.87   2.75%
parent      259.61  254.79  4.82    1.86%
prealloc    9.07    8.53    0.54    5.99%
socket      4.27    4.25    0.02    0.37%
privileged  25.11   25.08   0.03    0.11%
Mean                                2.26%
Mean ex1                            3.01%

Flags: needinfo?(pbone)

Mike Hommey [:glandium]

Comment 20

•

5 years ago

I'm kind of surprised you're getting such large RSSes at all, considering (and I had forgotten) that logalloc-replay doesn't memset() the allocated memory (since bug 1423000), so allocated memory is never actually committed unless you enable zero or junk.

Flags: needinfo?(pbone)

Neha Kochar [:neha]

Updated

•

5 years ago

Severity: -- → N/A

Fission Milestone: M7 → MVP

Whiteboard: [MemShrink] → [MemShrink] fission-memory

Paul Bone [:pbone]

Assignee

Comment 21

•

5 years ago

We need Bug 1671114 to measure the benefit of this change.

Depends on: 1671114

Phabricator Automation

Updated

•

5 years ago

Attachment #9181196 - Attachment is obsolete: true

Phabricator Automation

Updated

•

5 years ago

Attachment #9180104 - Attachment description: Bug 1669392 - pt 1. Add more jemalloc size classes r=glandium → Bug 1669392 - Add more jemalloc size classes r=glandium

Paul Bone [:pbone]

Assignee

Updated

•

5 years ago

Depends on: 1713271

Paul Bone [:pbone]

Assignee

Comment 22

•

5 years ago

•

Edited

Here's the updated RSS data

	RSS (before)	RSS (after)	delta	percent
example	8,508	8,220	288	3.39%
extension	15,396	15,524	-128	-0.83%
Fb-parent	231,156	220,344	10,812	4.68%
Fb	503,512	497,400	6,112	1.21%
google	557,944	543,100	14,844	2.66%
parent	263,788	259,032	4,756	1.80%
prealloc	5,964	5,428	536	8.99%
slack	116,624	112,908	3,716	3.19%
socket	1,624	1,544	80	4.93%
wiki	68,244	66,004	2,240	3.28%

This is almost always a benefit, saving sometimes up to 5% of a processes' memory usage.

Flags: needinfo?(pbone)

Paul Bone [:pbone]

Assignee

Comment 23

•

5 years ago

The next thing to test is a long browser session (eg 4 hours) to see if this negatively impacts fragmentation drastically.

Paul Bone [:pbone]

Assignee

Comment 24

•

5 years ago

I have tested a longer browser session (typical evening firefox usage for me) to check that there's no regression with these changes due to fragmentation - and none of the processes showed any regresion in RSS or bin-unused+swap (not shown).

There's the results from memory-replay at the end of the session for each of the processes.

	RSS (mc)	RSS patches	Delta	Percent
parent	1,589,548	1,574,680	14,868	0.94%
discord	140,788	137,112	3,676	2.61%
facebook	103,804	102,980	824	0.79%
github	75,960	69,660	6,300	8.29%
todoist	63,336	61,832	1,504	2.37%
twitter	151,728	146,480	5,248	3.46%
youtube	377,444	374,280	3,164	0.84%
total	2,502,608	2,467,024	35,584	1.42%

After pressing "Minimise memory usage" at the end of the session, extra usage due to fragmentation would have shown up here if anywhere.

	RSS (mc)	RSS patches	Delta	Percent
parent	1,015,704	999,272	16,432	1.62%
discord	113,580	111,100	2,480	2.18%
facebook	94,564	93,784	780	0.82%
github	46,604	45,196	1,408	3.02%
twitter	143,912	139,444	4,468	3.10%
youtube	367,224	363,912	3,312	0.90%

Paul Bone [:pbone]

Assignee

Comment 25

•

5 years ago

I ran some AWSY tests:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=9122dd221e6801f1db419147af8d981b71829b31&newProject=try&newRevision=68ab6bb4f439f4751497ba695c6a1514d2d65716&framework=4

Explicit base content memory has increased, but that's the trade-off these patches make, they increase fragmentation in order to decrease sloppy allocations, but overall reduce resident memory. We can verify that this is fragmentation in the memory reports for the base explicit memory shows there is an increase in bin-unused.

Some of the tests show a regression for resident memory, such as for windows, but viewing their subtests: https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=68ab6bb4f439f4751497ba695c6a1514d2d65716&originalSignature=2240017&newSignature=2240017&framework=4&originalRevision=9122dd221e6801f1db419147af8d981b71829b31 show that the "tabs closed" subtest is bringing the score down. Although that's a big difference, after the forced GC the regression disappears and returns to a win. In most of the other tests the regression disappears without the forced GC (only needing the 30 seconds). This may mean that the browser doesn't return memory quickly after closing tabs but on the whole it uses less memory. This is also a symptom of fragmentation since a single allocation can keep a chunk allocated.

Paul Bone [:pbone]

Assignee

Comment 26

•

5 years ago

The case I wanted to optimise is a content process running example.com, but AWSY doesn't test this. I tested it above with logalloc-replay and it showed a 288KB improvment. But when I test by comparing memory reports it's 100KB worse. That's not the result I'd hoped for, still there's a lot of improvment for larger content processes.

Paul Bone [:pbone]

Assignee

Comment 27

•

4 years ago

Any further thoughts/reviews for this glandium?

Thanks.

Flags: needinfo?(mh+mozilla)

Chris Peterson [:cpeterson]

Comment 28

•

4 years ago

This bug is a soft blocker for Fission MVP. We'd like to fix it before our Release channel rollout, but we won't delay the rollout waiting for it.

Whiteboard: [MemShrink] fission-memory → [MemShrink] fission-memory fission-soft-blocker

Mike Hommey [:glandium]

Updated

•

4 years ago

Flags: needinfo?(mh+mozilla)

Paul Bone [:pbone]

Assignee

Comment 29

•

4 years ago

Attached file Bug 1669392 - Add a 2nd Quantum-spaced size class r=glandium (obsolete) — Details

Revert the SubPage size class to it's original power-of-two sizing and make
the 512-4KiB range a 2nd Quamtum-spaced size class.

All the ranges defined for the size classes are now inclusive of there upper
bound to make them consistent.

Depends on D92729

Phabricator Automation

Updated

•

4 years ago

Attachment #9240222 - Attachment is obsolete: true

Paul Bone [:pbone]

Assignee

Comment 30

•

4 years ago

The latest results are here: https://docs.google.com/spreadsheets/d/1uvyu1JOxyydd2GcXdBQT7Q_Kk1m0_ink8l-YzhaYnXk/edit#gid=0

Chris Peterson [:cpeterson]

Comment 31

•

4 years ago

Setting status-firefox94=wontfix. Since the Nightly 94 code freeze is this week, Paul plans to wait until Nightly 95 to land these malloc changes.

Fission Milestone: MVP → Future

status-firefox92: --- → wontfix

status-firefox93: --- → wontfix

status-firefox94: --- → wontfix

status-firefox-esr78: --- → wontfix

status-firefox-esr91: --- → wontfix

Priority: P1 → P2

Whiteboard: [MemShrink] fission-memory fission-soft-blocker → [MemShrink] fission-memory [fission:m95]

Paul Bone [:pbone]

Assignee

Comment 32

•

4 years ago

Attached file Bug 1669392 - Provide a less-magic array size for jemalloc_stats r=glandium (obsolete) — Details

jemalloc_stats takes an array for its second argument. It expects this
array to have enough space for all the bins, previously the maximum was set
as a magic number. To make it dependent on the configured bins this patch
replaces the compile-time constant with a function.

Depends on D92729

Paul Bone [:pbone]

Assignee

Updated

•

4 years ago

Blocks: 1735250

Phabricator Automation

Comment 33

•

4 years ago

Comment on attachment 9244690 [details]
Bug 1669392 - Provide a less-magic array size for jemalloc_stats r=glandium

Revision D127761 was moved to bug 1735250. Setting attachment 9244690 [details] to obsolete.

Attachment #9244690 - Attachment is obsolete: true

Pulsebot

Comment 34

•

4 years ago

Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9de9bd47c061 Add more jemalloc size classes r=glandium

Marian-Vasile Laza

Comment 35

•

4 years ago

Backed out changeset 9de9bd47c061 (Bug 1669392) for causing build bustages.
Backout link
Push with failures - B
Failure Log

Flags: needinfo?(pbone)

Paul Bone [:pbone]

Assignee

Comment 36

•

4 years ago

oh, I need to move some code between patches.

Flags: needinfo?(pbone)

Pulsebot

Comment 37

•

4 years ago

Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8679a50bd45a Add more jemalloc size classes r=glandium

Natalia Csoregi [:nataliaCs]

Comment 38

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/8679a50bd45a

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox95: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 95 Branch

Paul Bone [:pbone]

Assignee

Updated

•

4 years ago

Blocks: 1735715

Andra Esanu (needinfo me)

Comment 39

•

4 years ago

== Change summary for alert #31892 (as of Fri, 15 Oct 2021 09:51:13 GMT) ==

Improvements:

Ratio	Test	Platform	Options	Absolute values (old vs new)
14%	perf_reftest_singletons link-style-cache-1.html	macosx1014-64-shippable-qr	e10s fission stylo webrender	1,026.34 -> 879.62
13%	perf_reftest_singletons link-style-cache-1.html	macosx1014-64-shippable-qr	e10s stylo webrender	1,020.05 -> 886.26
8%	perf_reftest_singletons link-style-cache-1.html	linux1804-64-shippable-qr	e10s fission stylo webrender	472.13 -> 434.69
7%	perf_reftest_singletons inline-style-cache-1.html	macosx1014-64-shippable-qr	e10s stylo webrender	1,721.18 -> 1,599.22
7%	perf_reftest_singletons inline-style-cache-1.html	macosx1014-64-shippable-qr	e10s fission stylo webrender	1,714.59 -> 1,601.56
6%	perf_reftest_singletons link-style-cache-1.html	linux1804-64-shippable-qr	e10s fission stylo webrender	475.06 -> 446.11

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=31892

Comment hidden (obsolete)

Acasandrei Beatrice (needinfo me)

Updated

•

4 years ago

Regressions: 1736357

Paul Bone [:pbone]

Assignee

Updated

•

4 years ago

Blocks: 1738240

Andreea Pavel [:apavel]

Updated

•

4 years ago

Regressions: 1735482

You need to log in before you can comment on or make changes to this bug.

Ratio	Test	Platform	Options	Absolute values (old vs new)
6%	Base Content Heap Unclassified	linux1804-64-shippable-qr	fission	2,064,687.00 -> 1,942,586.33
6%	Base Content Heap Unclassified	linux1804-64-shippable-qr		2,069,529.00 -> 1,949,781.67
5%	Heap Unclassified	macosx1015-64-shippable-qr	fission tp6	137,659,062.78 -> 130,560,838.51
3%	Base Content JS	macosx1015-64-shippable-qr		1,839,810.67 -> 1,782,668.00
3%	Base Content JS	linux1804-64-shippable-qr		1,836,650.00 -> 1,779,960.67
...	...	...	...	...
3%	Heap Unclassified	linux1804-64-shippable-qr	fission tp6	235,574,469.18 -> 229,589,143.47