Open Bug 736074 Opened 12 years ago Updated 2 years ago

The madvise(MADV_FREE) call we make in arena_run_dalloc to can take between 20%-30% of our time

Categories

(Core :: Memory Allocator, defect)

x86
macOS
defect

Tracking

()

People

(Reporter: ehsan.akhgari, Unassigned)

Details

(Whiteboard: [Snappy:P2])

Attachments

(2 files)

I keep seeing this in my profiles.  See the attached screenshot for a sample screenshot where we spent ~16s wall clock time doing this.  We need to stop calling madvise or find a more efficient way to do this.
Steps to reproduce would be nice...
(In reply to Justin Lebar [:jlebar] from comment #1)
> Steps to reproduce would be nice...

Open at least three gmail tabs and use them all, and use the browser for a day or so for regular browsing...
(In reply to Ehsan Akhgari [:ehsan] from comment #2)
> Open at least three gmail tabs and use them all, and use the browser for a
> day or so for regular browsing...

When do you turn on the profiler?
(In reply to Justin Lebar [:jlebar] from comment #3)
> (In reply to Ehsan Akhgari [:ehsan] from comment #2)
> > Open at least three gmail tabs and use them all, and use the browser for a
> > day or so for regular browsing...
> 
> When do you turn on the profiler?

Usually after a day or so of browsing I notice the browser to slow down to a crawl, and I profile it at that time.  Basically, every time I see the UI is starting to be non-responsive.  After my fix to bug 733277, this is now the #1 source of jankiness for me.
> After my fix to bug 733277, this is now the #1 source of jankiness for me.

Whoa, wait a second.  We just went from "this is hot in my profiler (mostly off main thread)" to "this is the #1 source of jank for me".  When we talked on IRC, you said you had only a "hunch" that this was causing jank?
This seems like an excellent candidate for telemetry
Whiteboard: [Snappy] → [Snappy][telemetry-needed]
(In reply to Taras Glek (:taras) from comment #6)
> This seems like an excellent candidate for telemetry

What are you going to measure, exactly?  The amount of time we spend in madvise()?  On/off main thread?  That still doesn't tell the whole story, since off-thread madvise can block on-thread madvise.

Anyway, I don't see what we'd do with this data.
> Anyway, I don't see what we'd do with this data.

By which I mean, free() should be as fast as we can make it.  If we can't make it faster, we should try to call it less often.  There's non-negligible overhead of calling gettimeofday twice for every free() / madvise(), plus we have to acquire telemetry's histogram locks.
(In reply to Justin Lebar [:jlebar] from comment #8)
> plus we have to acquire telemetry's histogram locks.

There are no locks in telemetry for this sort of measurement (currently).
> There are no locks in telemetry for this sort of measurement (currently).

I was afraid of that.  :)  We'd certainly have to use some kind of locking (or atomic integer operations) here, though, since free() is obviously multi-threaded.  I mean, I guess we could use thread-local storage and then aggregate periodically...  But we're getting into complex territory here, and we still have the overhead of two gettimeofday calls.

And again, I don't understand what the data would be useful for.  (I say this as a full-throated advocate of telemetry, in general.  :)
(In reply to Justin Lebar [:jlebar] from comment #7)
> (In reply to Taras Glek (:taras) from comment #6)
> > This seems like an excellent candidate for telemetry
> 
> What are you going to measure, exactly?  The amount of time we spend in
> madvise()?  On/off main thread?  That still doesn't tell the whole story,
> since off-thread madvise can block on-thread madvise.
> 
> Anyway, I don't see what we'd do with this data.

According to the screenshot you just need to wrap madvise call in timers.

What we would do with this date...well " When we talked on IRC, you said you had only a "hunch" that this was causing jank?". We'll know it's not a hunch and either declare this "not a problem" or devise a new madvise strategy.
good luck calling back telemetry from jemalloc
(In reply to Mike Hommey [:glandium] from comment #12)
> good luck calling back telemetry from jemalloc

doesn't jemalloc already keep stats, this can be polled at telemetry submit time.
> According to the screenshot you just need to wrap madvise call in timers.

Sure.  So we're talking two syscalls (gettimeofday) per madvise.  And then I still don't understand actionable information we might learn from this.

> doesn't jemalloc already keep stats, this can be polled at telemetry submit time.

Sure.  We already track the number of madvise calls per arena.

I think telemetry is a distraction in this bug.  We've already identified a performance bottleneck here.  Taras, if you think telemetry wrapping madvise is a good idea, would you mind filing a separate bug?
Summary: The madvise call we make in arena_run_dalloc to decommit the freed pages can take between 20%-30% of our time → The madvise(MADV_FREE) call we make in arena_run_dalloc to can take between 20%-30% of our time
(In reply to Justin Lebar [:jlebar] from comment #5)
> > After my fix to bug 733277, this is now the #1 source of jankiness for me.
> 
> Whoa, wait a second.  We just went from "this is hot in my profiler (mostly
> off main thread)" to "this is the #1 source of jank for me".  When we talked
> on IRC, you said you had only a "hunch" that this was causing jank?

Sorry this was a miscommunication.  What I said about the hunch was about the assumption that madvise can hold VM locks.  This problem is real, I'm seeing it every day, and it is currently the #1 source of jank for me.  The behavior on other platforms might be less bad, of course.

About getting telemetry data here, I agree with Justin that the overhead will not be acceptable.  We free pages all the time, and adding telemetry probes there will only make things slower for sure.

And I don't know what we're going to do with the data we'd get from telemetry for this.  This involves a bunch of locking issues across multiple threads which is very hard to infer from a histogram, IMO.

We should just make free as cheap as possible.  Calling free fewer times is also a good goal but I think that should be a different bug.
(In reply to Justin Lebar [:jlebar] from comment #14)
> Taras, if you think telemetry wrapping madvise
> is a good idea, would you mind filing a separate bug?

Filed as bug 736184.
Whiteboard: [Snappy][telemetry-needed] → [Snappy]
> This problem is real, I'm seeing it every day, and it is currently the #1 source of jank for me.  
> The behavior on other platforms might be less bad, of course.

What exactly makes you think madvise is causing jank?  Couldn't it be that there are lots of very short madvise calls, which add up to a lot of main-thread time, but which are spread out evently enough that they don't cause jank?

> We should just make free as cheap as possible.

Definitely.  You were doing a build with decommit, right?  Was that an improvement?
(In reply to Justin Lebar [:jlebar] from comment #17)
> > We should just make free as cheap as possible.
> 
> Definitely.  You were doing a build with decommit, right?  Was that an
> improvement?

You could also try see what happens if you remove the madvise.
> You could also try see what happens if you remove the madvise.

This would be one way to prove that the jank was actually caused by the madvise!  (Provided you don't run out of RAM and start paging...)
(In reply to Justin Lebar [:jlebar] from comment #19)
> > You could also try see what happens if you remove the madvise.
> 
> This would be one way to prove that the jank was actually caused by the
> madvise!  (Provided you don't run out of RAM and start paging...)

I doubt that would happen.
(In reply to Justin Lebar [:jlebar] from comment #17)
> > This problem is real, I'm seeing it every day, and it is currently the #1 source of jank for me.  
> > The behavior on other platforms might be less bad, of course.
> 
> What exactly makes you think madvise is causing jank?  Couldn't it be that
> there are lots of very short madvise calls, which add up to a lot of
> main-thread time, but which are spread out evently enough that they don't
> cause jank?

Well, in Instruments you can filter based on a time range in your profile and only see the stuff which happened then.  With that, you can filter the main thread, go and select ranges where the CPU was being used 100%, and measure how long it took and what was happening.  I'll attach my profile shortly.  It has multiple second-long pauses doing madvise.

> > We should just make free as cheap as possible.
> 
> Definitely.  You were doing a build with decommit, right?  Was that an
> improvement?

I am running that build right now, but I need to run it at least for a day to provide useful results.
Attached file My profile
> It has multiple second-long pauses doing madvise.

Awesome.

> I am running that build right now, but I need to run it at least for a day to provide useful results.

I eagerly await!  One thing you could do in the meantime is push your decommit patch to try so we can get an idea of how it affects our benchmarks.
(In reply to Justin Lebar [:jlebar] from comment #23)
> > It has multiple second-long pauses doing madvise.
> 
> Awesome.

Hope you're being sarcastic.  :P

> > I am running that build right now, but I need to run it at least for a day to provide useful results.
> 
> I eagerly await!  One thing you could do in the meantime is push your
> decommit patch to try so we can get an idea of how it affects our benchmarks.

Of course: https://tbpl.mozilla.org/?tree=Try&rev=818f274cd7fa
> Hope you're being sarcastic.  :P

Actually, I meant that's awesome evidence of the claimed problem.
(In reply to Justin Lebar [:jlebar] from comment #25)
> > Hope you're being sarcastic.  :P
> 
> Actually, I meant that's awesome evidence of the claimed problem.

Oh, it is that!  :-)
Try run for 818f274cd7fa is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=818f274cd7fa
Results (out of 93 total builds):
    success: 73
    warnings: 20
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/eakhgari@mozilla.com-818f274cd7fa
Compare-talos: http://goo.gl/U1uBv

The largest change is a 6.7% Dromaeo DOM regression.  There is some green in there too, although mostly at much smaller percentages.   I didn't take the time to determine whether any of the other green or red is an actual change.

Note that this patch has no effect on 10.5 -- jemalloc is disabled there, bug 702250 -- so any differences on 10.5 are spurious.

The fact that this regressed Dromaeo DOM suggests to me it likely won't be an improvement in Ehsan's daily use.  But we'll see, I guess...
I've been running with my MALLOC_DECOMMIT build for a few days, and in general I don't see any improvements.  :(  I read a bit more of the kernel code, and the problem seems to be two-fold:

1. Doing either madvise(MADV_FREE) or mmap(PROT_NONE,MAP_FIXED|MAP_PRIVATE|MAP_ANON) is similarly expensive.  They both end up calling vm_map_lookup_entry which is potentially slow (i.e. it can end up doing a linear scan.)

2. We do a lot of this on the main thread.  The big offenders are:
* js::LifoAlloc::freeAll called by JSCompartment::sweep during GC.
* js::LifoAlloc::freeUnused called by BeginMarkPhase during GC.

There is also a huge offender on the GCHelperThread:
* js::gc::Arena::finalize<JSString> called by FinalizeArenas during GCHelperThread::doSweep.

These operations also grab the VM lock, but (strangely) I don't see anything lock related in my profile (it's possible that the locks use kernel level wait operations which wouldn't appear in the profile as spin locks would, but I don't know enough about the OSX kernel to know whether it has that kind of lock or not.)

Why is it that we have to decommit these pages?  Is it to make sure won't run out of physical memory and start paging?  Or is it to get the OS to show the real RSS number?  Can we avoid doing decommits in some cases at least?
> Why is it that we have to decommit these pages?  Is it to make sure
> won't run out of physical memory and start paging?  Or is it to get
> the OS to show the real RSS number?

AIUI, that's for the latter. Also, aiui, MADV_FREE doesn't happen on every free()
> AIUI that's for the latter.

Which is also why i suggested to try removing the madvise in comment 18.
> Why is it that we have to decommit these pages?

The question should be, "Why do we have to give memory back to the operating system?"

> Or is it to get the OS to show the real RSS number?

That's jemalloc_purge_freed_pages(), which is run only when MALLOC_DOUBLE_PURGE is defined (and, as you discovered, MALLOC_DOUBLE_PURGE is incompatible with MALLOC_DECOMMIT).  So, no.

> Can we avoid doing decommits in some cases at least?

jemalloc already does this.  arena_dalloc calls arena_purge, which gives memory back to the operating system (via MADV_FREE or decommit), only if (arena->ndirty > opt_dirty_max).

Maybe opt_dirty_max should be increased.  But I'd want to understand better where the free's are coming from.
(In reply to Mike Hommey [:glandium] from comment #30)
> > Why is it that we have to decommit these pages?  Is it to make sure
> > won't run out of physical memory and start paging?  Or is it to get
> > the OS to show the real RSS number?
> 
> AIUI, that's for the latter.

In normal operation on Mac (when MALLOC_DECOMMIT is not defined, and MALLOC_DOUBLE_PURGE is defined), a true decommit (i.e. munmap) only happens when jemalloc_purge_freed_pages() is called.  In this profile, the hot function is arena_dalloc, which calls MADV_FREE in normal operation or munmap if MALLOC_DECOMMIT is defined.

Maybe this would be less confusing if we didn't talk about "decommit" but instead talked about madvise and munmap.

Again, the work we're doing to that the OS shows the correct RSS number is not what's showing up in this profile.  That work is done only under jemalloc_purge_freed_pages.  The work that's showing up in this profile is the madvise / munmap which we do to give memory back to the operating system.
> a true decommit (i.e. munmap) only happens when jemalloc_purge_freed_pages() is called.

And jemalloc_purge_freed_pages is only called when we read the process's RSS, which happens when you load about:memory or when Telemetry runs.
(In reply to Justin Lebar [:jlebar] from comment #32)
> > Why is it that we have to decommit these pages?
> 
> The question should be, "Why do we have to give memory back to the operating
> system?"

I'm a bit confused now.  We unmap the pages which we madvise at some point, right?  I guess my question is, why can't we avoid the madvise call and just rely on munmap to free the pages?

> > Can we avoid doing decommits in some cases at least?
> 
> jemalloc already does this.  arena_dalloc calls arena_purge, which gives
> memory back to the operating system (via MADV_FREE or decommit), only if
> (arena->ndirty > opt_dirty_max).

IINM, currently opt_dirty_max is 1024.  So given the fact that once this problem happens, it keeps on happening over and over again, could the real problem be freeing pages in a way which causes very few dirty pages to be adjacent, so we end up calling madvise a crazy number of times or something?

> Maybe opt_dirty_max should be increased.  But I'd want to understand better
> where the free's are coming from.

Does comment 29 contain enough information?
> I'm a bit confused now.  We unmap the pages which we madvise at some point, right?

We deallocate arena memory [1] in a different way on each of Linux, Windows, and Mac.

* Linux - We call madvise(MADV_DONTNEED) (mostly equivalent to mac's MADV_FREE) to free arena memory.  We never munmap arena memory.

* Windows - We munmap to free arena memory.  The equivalent to MADV_DONTNEED/MADV_FREE would be VirtualAlloc(MEM_RESET) [2].  jasone tells me over e-mail that he's not sure why MEM_RESET wasn't used in our Windows port, but perhaps because it interferes with measuring RSS in the same way that MADV_FREE on Mac interferes with RSS measurements.

* Mac - We call madvise(MADV_FREE) to free arena memory.  The munmap call on Mac is only for RSS measurement purposes, and munmap is called on arena memory only when we read RSS (telemetry, about:memory).  Aside from the measurement issue, we could get by just fine never calling munmap on Mac.  This Mac munmap is conceivably expensive, but I haven't seen it show up in any profile of yours.

So to answer your question, no, we do not necessarily unmap the pages which we madvise on Mac.

> I guess my question is, why can't we avoid the madvise call and just rely on munmap to free the pages?

We could.  That's what you did when you build with MALLOC_DECOMMIT defined.  But you're not avoiding an madvise -- you're replacing an madvise with an munmap.  It's likely that you didn't see a performance benefit because:

 * munmap seems to be roughly as expensive as madvise, and
 * MALLOC_DECOMMIT saves us an extra syscall only when we read RSS in telemetry/about:memory -- with DECOMMIT, reading RSS doesn't trigger any munmaps, but without DECOMMIT, we have to munmap.  But that code does not appear to be hot in your profiles.

> IINM, currently opt_dirty_max is 1024.  So given the fact that once this problem happens, it keeps on happening 
> over and over again, could the real problem be freeing pages in a way which causes very few dirty pages to be 
> adjacent, so we end up calling madvise a crazy number of times or something?

That sounds plausible to me.

> These operations also grab the VM lock, but (strangely) I don't see anything lock related in my profile

Does Shark/Instruments have a lock contention mode, or did you check that too?  spinlocks would show up in the CPU profiler mode, but you have a chance of seeing kernel wait locks in lock contention mode.

If this still isn't clear, ping me on IRC tomorrow and we can chat over a higher-bandwidth channel.

[1] Arena memory is where allocations smaller than 1MB live.  So-called "huge" allocations, bigger than 1MB, are always munmap'ed, on all platforms.  These allocations are not showing up in the attached profiles.

[2] http://msdn.microsoft.com/en-us/library/windows/desktop/aa366887%28v=vs.85%29.aspx
So the three concrete suggestions we have for moving forward here are:

 1) Comment out the madvise call.  (This will cause us never to free arena memory.)  Observe whether the build with this change is janky.  If not, this would confirm that madvise is the root cause of the jank.

 2) Increase opt_dirty_max and see if the build doesn't become janky after the same amount of use.  Or alternatively, decrease opt_dirty_max and see if the build becomes janky more quickly.

 3) We suspect that possibly the madvise calls when we're not janking cover large swaths of memory, while the madvise calls when we are janking cover smaller swaths.  You could report this by, e.g. keeping a histogram of how many pages we touched with one madvise() and occasionally dumping the histogram to the console.
Apparently it is possible to get some lock contention data out of Instruments using its System Trace view (although the data should be sort of inferred from the profile).  I'll do this the next time that my Firefox starts to jank.

(In reply to Justin Lebar [:jlebar] from comment #37)
> So the three concrete suggestions we have for moving forward here are:
> 
>  1) Comment out the madvise call.  (This will cause us never to free arena
> memory.)  Observe whether the build with this change is janky.  If not, this
> would confirm that madvise is the root cause of the jank.

Measuring the alternate theory (lock contention) should take care of this step.

>  2) Increase opt_dirty_max and see if the build doesn't become janky after
> the same amount of use.  Or alternatively, decrease opt_dirty_max and see if
> the build becomes janky more quickly.

OK, I'll do this next.  Do you have any idea why the current value was chosen?

>  3) We suspect that possibly the madvise calls when we're not janking cover
> large swaths of memory, while the madvise calls when we are janking cover
> smaller swaths.  You could report this by, e.g. keeping a histogram of how
> many pages we touched with one madvise() and occasionally dumping the
> histogram to the console.

This sounds like a good idea, I'll do this as well.
> OK, I'll do this next.  Do you have any idea why the current value was chosen?

I think it's arbitrary.  IIRC it's a page count, so it corresponds to jemalloc keeping around at most 4MB of memory that it could otherwise give back to the operating system.

There's obviously a trade-off between keeping around extra memory and giving it back to the OS, and maybe 4MB is too low.  But it's not clear why this only becomes a problem after you've been running the browser for days.
Perhaps because the memory usage keeps increasing?  (Also, fragmentation might become worse as time passes.)
> Perhaps because the memory usage keeps increasing?  (Also, fragmentation might become worse as time 
> passes.)

It could be fragmentation requiring us to make more madvise calls ((3) in comment 37).

Higher memory usage shouldn't affect this value -- it's 4mb of unused memory we keep around, regardless of how much memory we're using.  But of course if we churn (malloc and free) more memory, that would matter.
I've been meaning to test this, but every time I end up either forgetting to launch my patched build, or needing to restart Firefox for some reason before I can reproduce this.

I was curious, have other people also tried to profile a large Firefox session after a day or so?  I usually have three tabs of gmail, two tabs of TBPL and one tab of Facebook open.
p2 because this is a mac-only issue
Whiteboard: [Snappy] → [Snappy:P2]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: