Open Bug 1805644 Opened 1 year ago Updated 10 months ago

Speedometer 2 is ~5% faster with --disable-jemalloc

Categories

(Core :: Performance Engineering, task, P1)

task

Tracking

()

ASSIGNED

People

(Reporter: jandem, Assigned: pbone)

References

(Depends on 4 open bugs, Blocks 1 open bug)

Details

(Whiteboard: [sp3-p1] [sp3-preact-todomvc] [sp3-react-todomvc] [sp3-vanillajs-todomvc] [sp3-vuejs-todomvc] [sp3-jquery-todomvc])

See the perf comparison here for a --disable-jemalloc build:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=0c1382d65cf4765c31ee57a1d7fb34582f20922a&newProject=try&newRevision=1df24dd07d99a784ca22a13463a8ffe136cce728&framework=13&page=1

Some of the subtests are > 15% faster without jemalloc.

It's hard to say how much of this is from the system allocator being faster vs extra overhead we have for security features such as poisoning, but it suggests potential wins in this area.

Blocks: speedometer3

Marking this as P1 because I think it's critical that we get high confidence on why we see this effect, and whether we should invest in doing something about it, earlier rather than later. I think it's entirely plausible that our answer will be "this makes sense because X and so there's nothing we can do here," but this is a lot of opportunity to leave lying around.

Being concrete, what I would like to see answers to:

  • What distinguishes the tests that see very large wins from this from the tests that don't? (NOTE: I would start profiling tests roughly in order from highest confidence to lowest in the perfherder view, in order to hopefully see stable profiles which can easily be compared)
  • Can the difference in performance be explained by time directly attributed within malloc/free/etc, or could effects like bug 1805255 be playing a part?
  • For posterity, what's the whole list of extra variables that are bundled with us using jemalloc (poisining, PHC, profiler memory instrumentation all come to mind, but what's the whole list?)
    • Can we isolate the impact of those things?
  • What's the list of jemalloc-specific tuning that we've done outside of jemalloc, and what is the impact of all of those things? (DOMArena and DOM's work to try to size things in-line with jemalloc implementation details, uses of jemalloc_thread_local_arena, etc. - what's the whole list?)
    • Can we isolate the impact of those things?

If nothing else, answering these questions explicitly and in writing will give us something specific to point to detailing why we have not invested in migrating off of it.

Priority: -- → P1

GCP and I have been discussing the idea of testing various allocators to determine if jemalloc is still the best allocator for us. We are making these assumptions:

  • mozjemalloc has been tuned to work best for Firefox since it was forked from jemalloc3.
  • Firefox has been tuned to work best for mozjemalloc. Possibly in subtle unconscious ways just because it's what we test with.
  • Since we forked mozjemalloc research has continued on other allocators. So the state-of-the-art has changed since we last evaluated this.

it's the last point is why we think it's worth evaluating other allocators now, but we were unsure how strong the benefit may be. IMO this bug says that it is worth prioritising this.

One of our outreachy applicants expressed a strong interest in evaluating various allocators. I would like to work with her to investigate this. I will talk with her before I assign the bug.

Depends on: 1806054

Hi Doug, Excellent list of things to look at.

(In reply to Doug Thayer [:dthayer] (he/him) from comment #1)

Being concrete, what I would like to see answers to:

  • What distinguishes the tests that see very large wins from this from the tests that don't? (NOTE: I would start profiling tests roughly in order from highest confidence to lowest in the perfherder view, in order to hopefully see stable profiles which can easily be compared)

The top few subtests are about deleting or completing TODO items, which makes me feel like free() could be a problem. This becomes more pronounced when sorting by effect size rather than confidence. Selecting two profiles and looking at the "busy time" of the whole test, because more samples is more important than looking at a specific test. There's a 10x difference in the percentage of the time spent in free().

I landed a patch to improve free() this week. I'll run another test with that in place too. But I don't feel like it'll make much of a difference for this benchmark.

  • Can the difference in performance be explained by time directly attributed within malloc/free/etc, or could effects like bug 1805255 be playing a part?

So far it looks like it, I intend to confirm that with the logalloc-replay tool.

  • For posterity, what's the whole list of extra variables that are bundled with us using jemalloc (poisining, PHC, profiler memory instrumentation all come to mind, but what's the whole list?)

I don't know the complete list. But I can add the replace_* code adds a little function call overhead also.
Regarding the profiler overhead I think it could be reduced: Bug 1806054.

  • Can we isolate the impact of those things?

I'm sure we could. If there's no mechanism already to remove them then one should be added. We already disable poisoning in release builds, and I assumed PHC would be agnostic but it seems it isn't since it looks like --disable-jemalloc also removes the malloc replacement code needed to provide those hooks.

  • What's the list of jemalloc-specific tuning that we've done outside of jemalloc, and what is the impact of all of those things? (DOMArena and DOM's work to try to size things in-line with jemalloc implementation details, uses of jemalloc_thread_local_arena, etc. - what's the whole list?)
    • Can we isolate the impact of those things?

People probably also size structures to fit nicely into jemalloc's cells. But I modified jemalloc recently to make this more flexible. But I doubt that would have an impact, at least a consistent one. We may be able to isolate arenas with a patch,

The volunteer contributor hasn't responded to messages for a couple of weeks. So I've made a start on this.

Assignee: nobody → pbone
Status: NEW → ASSIGNED

I setup a test with logalloc-replay so we can compare different allocators with exactly the same allocation pattern and without other effects from Firefox. But it is limited to a single-threaded workload and has no cache effects from the rest of the browser - in otherwords it's pure (de)allocation throughput.

jemalloc: 8.574 s ± 0.122 s
glibc's builtin malloc (ptmalloc): 6.991 s ± 0.032 s

That's how far I got and I'll be on leave for the next week. So mainly as a reminder to myself: When I return I want to re-test this checking the status of poisoning and then start profiling logalloc-replay to learn more about the difference.

Flags: needinfo?(pbone)

Speedometer results on my PC:

Speedometer logalloc-replay
jemalloc with poisoning 81.2 ±1.6 8.607s ±0.057
jemalloc without poisoning 83.8 ±1.1 6.249s ±0.038
ptmalloc 88.9 ±1.8 7.026 s ±0.036

So that's weird. I tested in logalloc-replay first and when I got that huge improvment from disabling poisoning I figured that was the entire story. Since jemalloc without poisoning is faster than ptmalloc. I checked the results with speedometer and it didn't match, ptmalloc is still much faster there. My logalloc-replay test data is a previous run of speedometer, so I would expect to see a similar trend - just with results in logalloc-replay exadurated due to not running the rest of the browser. The next direction to look at is probably cache and paging effects, since without the browser's memory accesses logalloc-replay doesn't really test those either, and not across cores either.

Flags: needinfo?(pbone)

A very common way that jemalloc issues present in performance profiles is as locks stalled on main thread while some off-thread stuff is happening. I think the cross-thread performance behaviour is probably the most interesting thing for allocation performance here. The lineage of jemalloc that we are using is really a single-threaded allocator with locks thrown in to make it technically correct, but doesn't seem to have been designed for multi-threaded applications.

Yeah when we were doing stylo we ended up using thread-local arenas to avoid lock contention. Bug 1361258 and bug 1291355 comment 35 have various discussions about this. It might be that more subsystems need this treatment, or we might want to enable that by default on other threads as well?

Depends on: 1808429

(In reply to Ted Campbell [:tcampbell] from comment #7)

A very common way that jemalloc issues present in performance profiles is as locks stalled on main thread while some off-thread stuff is happening. I think the cross-thread performance behaviour is probably the most interesting thing for allocation performance here. The lineage of jemalloc that we are using is really a single-threaded allocator with locks thrown in to make it technically correct, but doesn't seem to have been designed for multi-threaded applications.

Lock contention showed up in a big way in the firefox profiler profiles. I agree, that's most likely the big difference. I'll pull at that thread (pun :P)

Frabrice asked me if I could test memory usage also:

Memory usage in the above test, captured from the memory reporter on the content process after running the speedometer benchmark:
Also I rebased to central since comment 6 which I don't expect to have an effect.

jemalloc with poison:
201.64 MB ── resident
478.88 MB ── resident-peak
79.54 MB ── resident-unique

jemalloc no poison:
182.11 MB ── resident
376.01 MB ── resident-peak
59.07 MB ── resident-unique

Hrm, a bit of a difference, there must be some memory we never touch, like buffers never filled or hash tables never populated - at least in those pages.

ptmalloc:
504.94 MB ── resident
551.50 MB ── resident-peak
394.34 MB ── resident-unique

For completness here's ptmalloc. That could be (one reason) why we prefer jemalloc ;-)

Whiteboard: sp3:p1

What I've learnt in the last few days.

The most contented lock in all of jemalloc is the arena lock.

The most contented arena is the JS Malloc arena. On average a thread waits 250-350ns to take the lock, other arenas are 30-60ns. Summing up all the times the lock is taken over a "speedometer 2 run" can come to about 2-3 seconds. I'm assuming that due to data moving between CPU caches that there are other cache misses causing slowness that I'm not measuring, in other words. Avoiding contention perfectly (like a spherical cow in a vacuum) would win more than 2-3 seconds of performance.

Other than the main thread, the JS engine helper thread to use this arena the most is the Ion compilation thread.

What I've tried (speedometer benchmark results):

Baseline performance: 80.4 +- 1.6
Give every JS helper + firefox thread its own arena: 81.4 +- 1.6 (Could be using more memory but I didn't measure that).
Every Ion task opts-in to a shared Ion arena: 78.3 +- 1.4
Every Ion task opts-in to a thread-local arena, once opted in threads can't opt out so other tasks using that helper thread end up using the new arena: 78.9 +- 1.9

These benchmark results are very close together and the one thing I feel confident saying is I don't think there's a magic lever to pull to bring us up to ~89 (ptmalloc). But I do have thoughts about things we could (try to) change about jemalloc and I'd still like to test other allocators.

See Also: → 1809058

Don't mean to derail this, but just throwing this in the mix: have we taken a hard look at Chrome's PartitionAlloc yet?

(In reply to Doug Thayer [:dthayer] (he/him) from comment #12)

Don't mean to derail this, but just throwing this in the mix: have we taken a hard look at Chrome's PartitionAlloc yet?

No, that's not derailing it at all as I intend to survey other contemporary allocators, especially those used in browsers.

I've got:

jemalloc3
mozjemalloc
jemalloc4
jemalloc5
ptmalloc2
tcmalloc
bmalloc
partitionalloc
mimalloc
mozmalloc
nedmalloc
ltalloc
TLSF
bmalloc
libpas
smalloc

I haven't done anything to strike any off the list yet. If there's anything else not on my list that should be I'd love to know about it.

Depends on: 1809610
Performance Impact: --- → none
Performance Impact: none → ---
Component: Performance → Performance Engineering

We weren't sure about the component here, feel free to move it elsewhere.

Depends on: 1810953
Depends on: 1810954
See Also: → 1811985
See Also: → 1811987
Whiteboard: sp3:p1 → [sp3:p1], [sp3:preact-todomvc], [sp3:react-todomvc], [sp3:vanillajs-todomvc], [sp3:vuejs-todomvc], [sp3:jquery-todomvc]

In Bug 1809610 I have a stack of patches that improve jemalloc's performance for free(). However some page loads are slower. gcp suggested testing jemalloc vs --disable-jemalloc with this wider set of tests to see what happens to page load performance with --disable-jemalloc. Mostly it's faster, which is consistent with the other tests above.

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=3240143fd8c0db90e502ba9ede37b54a21ec544d&newProject=try&newRevision=a9386af992509a321d863ba5c05ba7b9f1a631de&framework=13&page=1

Just FYI, I was curious if we're using mozjemalloc in a reasonable way. Bug 1258257 decreased page cache size quite a bit and then later, bug 1397101, arenas got by default even smaller size, just 32 pages. So, I tried out massive caches https://hg.mozilla.org/try/rev/6b07332f2f314d58d2f65650a9df02bd3dbfe656. That patch includes also stuff which tries to ensure that more DOM nodes may get deleted sooner without need to wait for CC to run.
https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=aa2de136615b48e4b0e995a5cd36ac7a25f1e28c&originalSignature=4586009&newSignature=4586009&framework=13&originalRevision=a8f477ea67a3da045869bb63d5b89630c8c8e93e&page=1

Given those results, it might be possible to also tweak our mozjemalloc usage. It is rather silly that we use same values for mMaxDirty no matter on what kind of system FF is running, and that we aren't updating mMaxDirty dynamically depending on what kind of load the process has.

Whiteboard: [sp3:p1], [sp3:preact-todomvc], [sp3:react-todomvc], [sp3:vanillajs-todomvc], [sp3:vuejs-todomvc], [sp3:jquery-todomvc] → [sp3-p1] [sp3-preact-todomvc] [sp3-react-todomvc] [sp3-vanillajs-todomvc] [sp3-vuejs-todomvc] [sp3-jquery-todomvc]

I have started some benchmarks to retest if --disable-jemalloc is still faster after our recent changes.

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=86a72388debf78af0d7cf1cd33a148e85cd953cf&newProject=try&newRevision=9f6a7a2d2d665d4333ec783042e30ede090c42df&framework=1&page=1

They're running now, but here's the URL.

Flags: needinfo?(smaug)

Seems like possibly
https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=9f6a7a2d2d665d4333ec783042e30ede090c42df&originalSignature=4586009&newSignature=4586009&framework=13&application=firefox&originalRevision=86a72388debf78af0d7cf1cd33a148e85cd953cf&page=1

--disable-jemalloc does use quite a bit more memory, AFAIK, so using that option is not really realistic.

But very useful test run. Looks like especially DOM heavy subtests like more memory. Perhaps we should try to increase the size of DOMArena even more, and/or the size of the generic arena. But need to figure out how to avoid AWSY regressions.

Flags: needinfo?(smaug)
You need to log in before you can comment on or make changes to this bug.