811 bytes, patch
|Details | Diff | Splinter Review|
847 bytes, patch
|Details | Diff | Splinter Review|
200.99 KB, image/png
16.12 KB, text/plain
1.17 KB, patch
|Details | Diff | Splinter Review|
Tests conducted in bug 580408 show that jemalloc 3 regresses our memory usage after tab close.
Is this still the case with the newest jemalloc version?
(In reply to Florian Bender from comment #1) > Is this still the case with the newest jemalloc version? Recent measuring shows it's still the case.
It looks like disabling the thread cache (tcache) helps: https://areweslimyet.com/?series=bbug762448&evenspacing . The builds in that run are, from left to right, jemalloc3 with tcache disabled , jemalloc3 with tcache enabled , and mozjemalloc . Due to an oversight on my part, the jemalloc3 builds also include my heap partitioning patches. We know from previous experimentation that the overhead introduced by these patches over a plain jemalloc3 build is rather small (and certainly smaller than the jemalloc3 vs. mozjemalloc regression), but bear in mind that it is present in the graph too. I'll be happy to re-run the tests without my patches if necessary, though, to get a more accurate view of jemalloc3's isolated effects. 1- https://tbpl.mozilla.org/?tree=Try&rev=fd6445a170ae 2- https://tbpl.mozilla.org/?tree=Try&rev=b161078673c0 3- https://tbpl.mozilla.org/?tree=Try&rev=caf0405b7e99
Mmmm I thought I had disabled tcache, but memory/build/jemalloc_config.c tells I'm wrong.
That said, I *do* think I tested with tcache disabled back then, and while it was making a difference, it didn't fix the entire RSS regression. It looks like it does, now.
(In reply to Mike Hommey [:glandium] from comment #5) > It looks like it does, now. Actually, I misread it, there's a small remainder of a regression.
Here's the patch for disabling the tcache. Try: https://tbpl.mozilla.org/?tree=Try&rev=b4fe96ca4634 . I'll look at the oranges there today. I also added the Try build to the AWSY run in comment 3 as the fourth column in the graph, so we can see what the RSS looks like without the partitioning patches. Looks like we regress by very little when not partitioning, are we willing to tolerate that for the purposes of this bug?
Attachment #8510171 - Flags: review?(mh+mozilla)
Attachment #8510171 - Flags: review?(mh+mozilla) → review+
(In reply to Guilherme Gonçalves [:ggp] from comment #7) > I also added the Try build to the AWSY run in comment 3 as the fourth column > in the graph, > so we can see what the RSS looks like without the partitioning patches. > Looks like we regress > by very little when not partitioning, are we willing to tolerate that for > the purposes of this > bug? I'm going to land bug 1083686 once try confirms it doesn't end up burning the tree. Would you mind getting a log from an AWSY run, and fool around with logalloc-replay with mozjemalloc and jemalloc3 for comparison? That would eliminate some of the noise that would be due to AWSY runs not doing the exact same allocations at the same exact time, and would give us a clearer picture of what the regression actually looks like.
(Note that the jemalloc_stats output /might/ be insufficient. Ping me if you need more)
(In reply to Mike Hommey [:glandium] from comment #8) > Would you mind getting a log from an AWSY run, and fool around > with logalloc-replay with mozjemalloc and jemalloc3 for comparison? Sure. I tried it locally and it does seem to run well enough, I can get :johns to help run it on AWSY tomorrow. A couple of questions, though: - While preprocessing the allocation log locally, I got several ignored calls to free ("Did not see an alloc for pointer"). Is that expected? - Looks like there's no way to replay allocations made by different threads, right? So this will be a good way to compare mozjemalloc and jemalloc3 with tcache disabled, but not enough to assess the regression itself;
(In reply to Guilherme Gonçalves [:ggp] from comment #10) > (In reply to Mike Hommey [:glandium] from comment #8) > > Would you mind getting a log from an AWSY run, and fool around > > with logalloc-replay with mozjemalloc and jemalloc3 for comparison? > > Sure. I tried it locally and it does seem to run well enough, I can get > :johns to help run it on AWSY tomorrow. A couple of questions, though: > > - While preprocessing the allocation log locally, I got several ignored > calls to free ("Did not see an alloc for pointer"). Is that expected? On mac, /maybe/, but it sure could be worth investigating them. > - Looks like there's no way to replay allocations made by different threads, > right? So this will be a good way to compare mozjemalloc and jemalloc3 with > tcache disabled, but not enough to assess the regression itself; Well, the tcache regression is kind of obvious, since tcache adds per-thread data. But AWSY seems to show /some/ level of regression without tcache, and that's what we should be tracking here, and replaying the same allocations would allow to eliminate variations due to the change in order of allocations. Or due to some other things.
:johns has provided us with logs of a sample AWSY run. I'm hosting the log  and its munged version  for convenience. While running this locally, I came across crashes in both mozjemalloc and jemalloc3. The crash in mozjemalloc was being caused by an assertion related to memory corruption . It looked like Replay::Commit was overwriting jemalloc's metadata; the patch at bug 1093174 seems to solve that (though I'm not quite sure why). I do manage to replay the run successfully with that patch applied, I'll have experimental results shortly. 1- https://people.mozilla.org/~ggoncalves/malloc-log-ggp.log.xz 2- https://people.mozilla.org/~ggoncalves/malloc-log-ggp.munged.xz 3- http://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c?from=memory/mozjemalloc/jemalloc.c#4416
Depends on: 1093174
Here are the results of replaying the AWSY log in comment 12 over mozjemalloc and jemalloc3 with tcache disabled. The run contains 64 calls to jemalloc_stats(), and while it's not clear to me how they correspond to the 8 data points AWSY collects for each run, they should be enough for comparison of overall memory usage. The image shows 5 pairs of graphs, three in the top row, two in the bottom row, each containing the 64 measurements for a different metric (labeled in the y axis). In each pair, the top graph shows the numbers in MiB, and the bottom shows the difference in values between mozjemalloc and jemalloc3. Out of the stats returned by jemalloc_stats, I've omitted 'bookkeeping' and 'bin_unused', as they're not implemented in jemalloc3. On the other hand, I measured RSS in each jemalloc_stats() call using the task_info OSX syscall [1,2], so we have figures for that too. While the stats for 'allocated', 'mapped' and 'rss' show that the remaining regression is rather small, I don't know how to explain the big difference in 'dirty' and 'waste', nor why 'waste' is always zero in mozjemalloc. 1- http://nadeausoftware.com/articles/2012/07/c_c_tip_how_get_process_resident_set_size_physical_memory_use 2- https://stackoverflow.com/questions/5839626/how-is-top-able-to-see-memory-usage
And here is the raw output of logalloc-replay for both allocators, concatenated in a single file.
Mike, what do you think? Can we live with the current situation, or is the RSS difference peaking at 3MiB still too bad?
Assignee: nobody → ggoncalves
3MiB is tolerable, I guess. Did you check with jemalloc trunk?
(note we should investigate the weird values on e.g. waste, too)
Not yet, I'll re-test when we settle on a revision to update to, after upstreaming bug 899126. But let's land the tcache patch for now. https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=f483e65fe480
Summary: Investigate jemalloc 3 RSS regression → Investigate jemalloc 3 RSS, dirty and waste regression
Looks like the regression in waste can be explained by the fact that jemalloc3's "stats.active" includes what we report as bin_unused. So in jemalloc3 we compute waste as: stats->waste = mallctl("stats.active") - mallctl("stats.allocated") whereas in mozjemalloc, it is: for each arena: stats->waste += arena_committed - arena_allocated - arena_dirty - arena_unused - arena_headers; (... compute total space occupied by chunk headers as chunk_header_size ...) stats->waste -= chunk_header_size; jemalloc3's "stats.active" is equivalent to the sum over all arenas of arena_committed + arena_unused, and already discounts chunk_header_size, arena_headers and arena_dirty. Since the sum of arena_unused over all arenas is bin_unused, subtracting bin_unused from stats->waste should fix the regression. I've tested this by updating to latest jemalloc3, then applying the patch in bug 899126 to get bin_unused working, then applying this patch. As bug 899126 comment 5 already shows, waste is equal to binunused on a simple logalloc run, so waste is zero. I have confirmed that this is also the case with our logged AWSY run, which is the same we get with mozjemalloc.
And for reference, here's where I believe bin_unused gets added to "stats.active": https://github.com/jemalloc/jemalloc/blob/dev/src/arena.c#L247 . arena_run_split_remove gets called whenever a new run is allocated for an arena, and at that point need_pages is the total number of pages needed to accommodate the entire run, including unused regions.
Comment on attachment 8521655 [details] [diff] [review] Deduct bin_unused from waste in jemalloc3's stats. Actually, sorry, the patch obviously depends on bug 899126 landing first. There's no point in reviewing it now.
And the regression in dirty can be resolved by tweaking the "lg_dirty_mult" jemalloc3 option, which specifies the base 2 log of the acceptable ratio of active to dirty pages per arena. The default value is 3, meaning the allocator will attempt to keep a ratio of 8:1 active to dirty pages. Given how: 1- We only use one arena, for both jemalloc3 and mozjemalloc; 2- For the entire recorded AWSY run, the value of allocated, and also of jemalloc3's "stats.active", never exceeds 250MiB (and there's actually some slack there); 3- We're currently willing to tolerate at most 4MiB of dirty pages with mozjemalloc, as per the DIRTY_MAX_DEFAULT macro; I think we can use a conservative value of ceil(log2(250/4)) = 6. I'll upload a patch to do that shortly. I'm attaching the graphs for mozjemalloc vs. jemalloc3, after updating to latest jemalloc3 and applying the patches for waste and dirty. We're now within a 2MiB margin of mozjemalloc, and rss has dropped a little further now that we're being more aggressive about purging pages.
Attachment #8516891 - Attachment is obsolete: true
The raw data used to generate the graphs.
Attachment #8516892 - Attachment is obsolete: true
And the patch.
(In reply to Guilherme Gonçalves [:ggp] from comment #24) > 3- We're currently willing to tolerate at most 4MiB of dirty pages with > mozjemalloc, as per the DIRTY_MAX_DEFAULT macro; > > I think we can use a conservative value of ceil(log2(250/4)) = 6. Important detail: in bug 811740 we limit to *1* MiB on B2G, so this value would need to be different for that platform. http://hg.mozilla.org/mozilla-central/annotate/e7f3e6fee15e/memory/mozjemalloc/jemalloc.c#l155
Good point, let's go with 8 on B2G.
Attachment #8532679 - Flags: review?(mh+mozilla)
Comment on attachment 8532679 [details] [diff] [review] Bump opt.lg_dirty_mult in jemalloc3 to reduce number of dirty pages. Review of attachment 8532679 [details] [diff] [review]: ----------------------------------------------------------------- ::: memory/build/jemalloc_config.c @@ +17,5 @@ > +#endif > + > +#define MOZ_MALLOC_OPTIONS \ > + "narenas:1,lg_chunk:20,tcache:false" MOZ_MALLOC_PLATFORM_OPTIONS > +MFBT_DATA const char * je_(malloc_conf) = MOZ_MALLOC_OPTIONS; Make this je_(malloc_conf) = MOZ_MALLOC_OPTIONS MOZ_MALLOC_PLATFORM_OPTIONS; instead
Attachment #8532679 - Flags: review?(mh+mozilla) → feedback+
Attachment #8521655 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8539391 [details] [diff] [review] Bump opt.lg_dirty_mult in jemalloc3 to reduce number of dirty pages. Review of attachment 8539391 [details] [diff] [review]: ----------------------------------------------------------------- ::: memory/build/jemalloc_config.c @@ +10,5 @@ > #include "mozilla/Types.h" > > /* Override some jemalloc defaults */ > +#ifdef MOZ_B2G > +#define MOZ_MALLOC_PLATFORM_OPTIONS ",lg_dirty_mult:8" Maybe add a comment here why B2G has a different value. Sorry for the ping-pong.
Attachment #8539391 - Flags: review?(mh+mozilla) → review+
Done. Carrying over r+.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
You need to log in before you can comment on or make changes to this bug.