Closed Bug 842800 Opened 11 years ago Closed 3 years ago

Track peak memory usage

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(1 file, 1 obsolete file)

about:memory is great, but it gives us zero insight into peak memory consumption.  Measuring peak consumption isn't easy to do in a fine-grained fashion, but it should be possible to get it for the C heap and JS heap without too much trouble.

That should be enough to give us a sense of whether peak memory consumption is significantly higher than the steady-state memory consumption reported by about:memory.
Another idea is to periodically measure RSS, to get its peak.
Whiteboard: [MemShrink] → [MemShrink:P2]
Assignee: nobody → n.nethercote
Something we probably want to do in this bug or in a follow-up is to give users a way to reset the high water mark.  That way you can more easily debug things like "what's the high water mark when I do X"?
> Something we probably want to do in this bug or in a follow-up is to give
> users a way to reset the high water mark.

The simplest implementation of that would be to reset it every time it's requested.  Might be too inflexible, though.  We'd have to see.
I'm going to change the direction of this bug slightly and make it a bit more ambitious.

The hard part about measuring peak memory consumption is knowing when you're at
the peak.  My goal here is to have a "peak-finding" mode, probably triggered in
about:memory, which runs slightly slower than normal.  Ideally it would work in
vanilla builds.

Here's the basic approach.

- Have (a) a cheap way to measure "total" memory consumption, and (b) an
  expensive, detailed measurement of memory.

- Do the cheap total measurement frequently.
  - How frequently determines likelihood of missing a peak, and affects the
    slow-down.

- When the cheap total measurement hits a new peak, if the peak is X% higher
  than the previous peak total, do an expensive measurement.
  - The value of X determines how how close we get to the peak, and affects the
    slow-down.

For our implementation, the obvious candidate for the expensive, detailed
measurement is a memory report dump to $TMPDIR.  And X% can be user-selected
via about:memory;  5% might be a reasonable default.

So that's easy.  The hard part is choosing the cheap measurement, and its
frequency of measurement.  Some possibilities for the measurement...

- |nsMemoryReporterManager.explicit|.  It's moderately expensive.
  We certainly couldn't see every change to it.

- jemalloc's "heap-allocated" measurement.  The current implementation is
  moderately expensive because it measures lots of other things at the same
  time, but we could hack mozjemalloc to improve that.  We might be able to see
  every change to it, but that would ignore non-heap allocations.
  
- RSS is the most important measurement for mobile OOMs(?).  So measure it via
  |nsMemoryReporterManager.resident|.  

  - On Linux/Android this opens and reads /proc/self/statm (a single-line
    file).
  - On Windows it calls GetProcessMemoryInfo().
  - On Mac it calls GetTaskBasicInfo(), and if you want it to be accurate, also
    purges jemalloc's freed pages... ugh.
  
  Again, we couldn't see every change to it.

None of these are all that cheap.  (E.g. the cheapest possible measurement
would be simply reading a global counter.)  So we couldn't check them very
frequently.

Some possibilities for the peak checking frequency...

- On a timer -- e.g. every second, half-second, quarter-second, whatever.
  (User-controllable.)

- Every Nth event in the event loop.  (Also user-controllable.)

- In response to "check-peak" events, which could be triggered by our
  allocators (jemalloc, JS heap) every time a large chunk is allocated?

- Some combination of the above.

It's unclear how short-lived our peaks are, and therefore how frequently we
need to measure to catch them.  Hopefully not too short-lived.
Summary: Track peak memory usage of the C and JS heaps → Track peak memory usage
To start with, I'll try measuring RSS on Linux a few times a second, and see how that works.
A few comments/questions:
- I'm a bit wary of time-based peak sampling; you can (and we are in these asm.js demos) allocate a ton of memory in a short period of time.  What about hooking mmap?  It seems like mmap calls would be rather more infrequent than malloc.
- In addition to dumping about:memory to $TMPDIR, could we also get the stack trace of the malloc/mmap call that hit the peak?  The two together would, I think, be a pretty useful combo.
- It'd be great if we could see more than just the max peak.  E.g., right now I'm pretty sure the peak for the UE3 demos is the parse nodes, but there are 3 or more shorter peaks after that that I'd like to understand as well.  Perhaps a large free() call could reset the peak measurement?
What we did on Windows with AvailableMemoryTracker may be instructive.

There, we wrapped VirtualAlloc (the equivalent of mmap), and every time a large VirtualAlloc came through, we'd read rss, which is fast compared to running VirtualAlloc.

On Windows, you have to explicitly commit your memory when you call VirtualAlloc (although I don't know if memory committed this way immediately shows up in rss).  This is different from *nix, where if you mmap 100mb, the OS doesn't reserve any space in memory or the swap file until you touch it a page.  The fact that we knew how much memory was being committed when VirtualAlloc was called made this scheme somewhat reasonable.

I can imagine doing something similar on *nix.  We'd wrap mmap and then maybe set up write barriers on the resultant pages, so we know when they are committed?

But this seems like a lot of work, when it's not clear to me that simply dumping about:memory to JSON once a second is unworkable.  Has that been tried?
> - It'd be great if we could see more than just the max peak.

I think this is a pretty important case to consider.  As soon as you see the peak and establish either that you can't fix it or that it's being fixed, you'll want to see the other peaks.
> - In addition to dumping about:memory to $TMPDIR, could we also get the
> stack trace of the malloc/mmap call that hit the peak?  The two together
> would, I think, be a pretty useful combo.

Should be pretty easy.  BTW, I think you'll like the memory reporting that Benoit is putting into SPS right now :)
I have a draft patch working, but there's a problem.

Here's how it works.

- about:memory has "start peak-finding mode" and "stop peak-finding mode"
  buttons.

- When you start peak-finding mode, a new thread is spawned.  It loops,
  reading RSS five times a second.  If the RSS is 10% higher than it was at
  the last dump, it dumps reports to $TMPDIR.

- When you stop peak-finding mode, the thread is canceled.

Pretty simple.  But in the "dump reports" step it ends up in
nsMemoryReporterManager::EnumerateReporters(), which aborts because it isn't
the main thread:

NS_IMETHODIMP
nsMemoryReporterManager::EnumerateReporters(nsISimpleEnumerator **result)
{   
    // Memory reporters are not necessarily threadsafe, so EnumerateReporters()
    // must be called from the main thread.
    if (!NS_IsMainThread()) {
        MOZ_CRASH();
    }   

So I thought "aha, I'll just spawn a runnable to do the dumping, and dispatch
it to the main thread!"  That works, in that it avoids the crash.  But the
runnable can be significantly delayed.  For example, I ran the Unreal demo and
it spawned 15 dump runnables, but none of them actually dumped until the
memory-intensive start-up sequence had finished -- i.e. I missed all the
interesting bits, and ended up with lots of dumps that occurred all in quick
succession and so gave almost identical measurements.

I guess I need to change things so that reporters are thread-safe...?
Hmm.  I reverted to the "dump from the peak-finding thread" version and disabled the assertion in EnumerateReporters() to see what would happen.  I immediately hit a similar assertion:

nsXPConnect*
nsXPConnect::GetXPConnect()
{
    // Do a release-mode assert that we're not doing anything significant in
    // XPConnect off the main thread. If you're an extension developer hitting
    // this, you need to change your code. See bug 716167.
    if (!MOZ_LIKELY(NS_IsMainThread() || NS_IsCycleCollectorThread()))
        MOZ_CRASH();

This function is called by the JS reporters to get ahold of the main JS runtime.  Hmm.

So I think I need a way to say "run this function on the main thread, right now, interrupting whatever it's doing".  I'm not optimistic we have such a thing.
Or, a way to periodically (and frequently) run the peak-finding function on the main thread itself would suffice.  I also can't see how that would be achieved.  ("Hey, let's interrupt the JS parser in the middle of parsing this huge script to run some arbitrary Gecko function...")
(In reply to Nicholas Nethercote [:njn] from comment #11)
> So I think I need a way to say "run this function on the main thread, right
> now, interrupting whatever it's doing".  I'm not optimistic we have such a
> thing.

You can get this on mac/linux with signaling but not windows AFAIK. You want to tgkill with the main thread's TID. The registered signal handler will run on the main thread. However everything you call must be signal safe as you know so calling Gecko code with again immediately blow up.
> So I think I need a way to say "run this function on the main thread, right now, interrupting 
> whatever it's doing".  I'm not optimistic we have such a thing.

You can try the mechanism used by nsThread:::ScheduleMemoryPressureEvent.  It's pretty simple.  You're still going to wait for the next turn of the event loop, though.
> You can get this on mac/linux with signaling but not windows AFAIK. You want
> to tgkill with the main thread's TID. The registered signal handler will run
> on the main thread. However everything you call must be signal safe as you
> know so calling Gecko code with again immediately blow up.

So... no hope there.

> You can try the mechanism used by nsThread:::ScheduleMemoryPressureEvent. 
> It's pretty simple.  You're still going to wait for the next turn of the
> event loop, though.

I tried that, it's still not immediate enough for the Unreal demo -- 13 dump requests were made before we got back to the event loop.

I'm out of ideas :(
Hmm... the JS engine has it's "operation callback", which it checks and possibly runs during long-running operations.  I might be able to use that.  It currently doesn't get checked during parsing, but it should be possible to add a check there.

Assuming that can be dealt with, it's unclear how many other places in the browser have potentially long-running, memory-intensive operations that take place between iterations of the main event loop.  We'll just have to suck it and see.
The operation callback does the trick!  Here's a rough draft patch that uses
it.  Things to note:

- Every time we hit a peak, the patch (a) triggers a JS engine operation
  callback, and (b) schedules an event on the main Gecko thread.  Whichever is
  hit first does the dump, and nullifies the other one.

- I added some operation callback checks (JS_CHECK_OPERATION_LIMIT) to the
  parser.  This is enough that all but two of the dump requests get handled
  immediately (before it was 13).

  I'm pretty sure the remaining two are due to Ion compilation, because in the
  last dump before the skipped two there is no asm.js code, and in the first
  one afterwards there is 84 MiB of it.  So I need to find the long-running
  loop(s) in Ion and add operation callback checks to them.

- I ended up putting most of the code (including
  nsMemoryReporterManager::FindPeaks()) in nsJSEnvironment.cpp, because that
  file already imports tons of stuff;  plus it has a pointer to the main
  JSRuntime just hanging around, which I needed to trigger the operation
  callback.

- It's only been tested on Linux, and uses various non-portable operations like
  pthread_create() and usleep().  It also spams stderr with various messages.
Here's an excerpt of the measured peak running the Citadel demo.  Note the 1304 MiB of "temporary" (i.e. parse nodes) memory.  The actual peak was about 100--150 MiB higher;  I suspect that this measurement misses some memory that IonMonkey transiently allocated at the true peak.

1,711.08 MB (100.0%) -- explicit
├──1,416.63 MB (82.79%) -- js-non-window
│  ├──1,394.68 MB (81.51%) -- runtime
│  │  ├──1,304.00 MB (76.21%) ── temporary
│  │  ├─────84.91 MB (04.96%) -- code
│  │  │     ├──84.54 MB (04.94%) ── asm.js
│  │  │     └───0.38 MB (00.02%) ++ (6 tiny)
│  │  └──────5.77 MB (00.34%) ++ (10 tiny)
│  ├─────19.03 MB (01.11%) ++ zones
│  └──────2.92 MB (00.17%) ++ gc-heap
├────233.16 MB (13.63%) ── heap-unclassified
├─────32.11 MB (01.88%) -- window-objects
│     ├──18.89 MB (01.10%) -- top(file:///home/njn/Desktop/asm-js/EpicCitadel/index.html, id=15)
│     │  ├──18.69 MB (01.09%) -- active
│     │  │  ├──17.84 MB (01.04%) -- window(file:///home/njn/Desktop/asm-js/EpicCitadel/index.html)
│     │  │  │  ├──17.59 MB (01.03%) ++ js-compartment(file:///home/njn/Desktop/asm-js/EpicCitadel/index.html)
│     │  │  │  └───0.25 MB (00.01%) ++ (4 tiny)
│     │  │  └───0.85 MB (00.05%) ++ (2 tiny)
│     │  └───0.20 MB (00.01%) ++ js-zone(7fbdb8bf8800)
│     └──13.21 MB (00.77%) ++ (4 tiny)
└─────29.18 MB (01.71%) ++ (17 tiny)
Wild and presumptive idea: in this about:memory "peak finding mode", wouldn't it be cool if about:memory displayed a graph with time on the x axis and total explicit usage on y axis with nodes that you could click on that displayed the dumps.  Even cooler if, say, window.dump/console.log messages showed up on this timeline so that the user could instrument their code to see relatively when things were happening.
> - It'd be great if we could see more than just the max peak.  E.g., right
> now I'm pretty sure the peak for the UE3 demos is the parse nodes, but there
> are 3 or more shorter peaks after that that I'd like to understand as well. 
> Perhaps a large free() call could reset the peak measurement?

Measuring RSS every 200 ms I see two main peaks:  first, it rises steadily to the parsing/compilation peak of about 2.1 GB, and then it briefly drops back to 530 MB before climbing to 1.3 GB (albeit with a small, temporary dip from 800 MB to 780 MB) while the demo is running.

I think having a "single peak-finding mode" and a "multiple peak-finding mode" might be useful.  Sometimes you just want one peak, and you don't want to end up with lots of dumps in a session where memory consumption goes up and down a lot.  But sometimes you really do want to know about multiple peaks, as you say.  In multiple mode I could reset the peak if RSS drops by half, perhaps?  Getting the multi-peak heuristic right could be tricky.
Depends on: 856917
> Getting the multi-peak heuristic right could be tricky.

Perhaps we could do something like measure once every Y * {how long measuring takes} (perhaps Y=10) and also measure when RSS increases or decreases by at least Z% over the last time we measured?
I'm putting this on the backburner.  Here's the most recent patch, which
applies cleanly on top of 134901:22c88d3d898a.
Attachment #737337 - Attachment is obsolete: true
bkelly, I wonder if it would be useful to have the ability to trigger a memory report dump the first time RSS exceeds a certain value?
Flags: needinfo?(bkelly)
(In reply to Nicholas Nethercote [:njn] from comment #23)
> bkelly, I wonder if it would be useful to have the ability to trigger a
> memory report dump the first time RSS exceeds a certain value?

That would absolutely be valuable for situations where the process is getting OOM killed on a device.  We could set the limit just under where the OOM occurs and then dump the memory report before its killed.  Ideally it would be the combination RSS+swap, though, as leaking processes on tarako tend to just inflate swap.

The two obstacles that come to mind are:

1) Where/when do we check for the increase?  Ideally we want to catch the increase as soon as possible, but increases may not all come via malloc, etc.
2) How do we get the report across ADB to the user?  Stick them on disk somewhere and then wait for the user to pull them later with ADB?

If you're asking me if I have time to work on this, however, I don't think I will for the foreseeable future.
Flags: needinfo?(bkelly)
> 1) Where/when do we check for the increase?  Ideally we want to catch the
> increase as soon as possible, but increases may not all come via malloc, etc.

Previous comments have some discussion of this.

> 2) How do we get the report across ADB to the user?  Stick them on disk
> somewhere and then wait for the user to pull them later with ADB?

Something something get_about_memory.py mumble mumble.

> If you're asking me if I have time to work on this, however, I don't think I
> will for the foreseeable future.

Nope, wasn't asking that. One of the difficulties of this bug is getting the peak-finding heuristics right. If we can just get the user to set a threshold and do one dump when we hit that, it makes things quite a bit simpler.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: