Closed Bug 634642 Opened 13 years ago Closed 13 years ago

Slow gradual memory increase in Fennec

Categories

(Firefox for Android Graveyard :: General, defect)

defect
Not set
normal

Tracking

(fennec-)

RESOLVED FIXED
Tracking Status
fennec - ---

People

(Reporter: mbrubeck, Assigned: azakai)

References

Details

(Keywords: memory-footprint, mobile, regression)

Attachments

(8 files)

Memory measurements are based on the "Pss" total from "dumpsys meminfo org.mozilla.fennec" running on a Galaxy Tab with Android 2.2.

On startup, Fennec trunk Android builds use around 45MB of RAM.

As I navigate in a single tab (without opening any new tabs), memory use increases by about 1MB every time I navigate to a new page.
tracking-fennec: --- → ?
In comparison to the Galaxy Tab, on the Nexus One, I start off at 47MB,
navigating to a single page in the single open tab, memory usage shot to 56MB
(based off Pss total)
Are people seeing this only on Android? Not on desktop?
I can reproduce this on desktop Linux too.  It looks like the RSS for the chrome process grows continually without bound, while the content process remains mostly flat long term (going up and down depending on which page is loaded).
On Linux desktop, I see the fennec process start at 54. Then, with a combination of reloads and clicking on links, it rises to 85.

However, it then drops to 74 eventually. And the process repeats.

So it doesn't increase without limit in my testing.
(In reply to comment #4)
> On Linux desktop, I see the fennec process start at 54. Then, with a
> combination of reloads and clicking on links, it rises to 85.
> 
> However, it then drops to 74 eventually. And the process repeats.
> 
> So it doesn't increase without limit in my testing.

Hmm, maybe I just didn't test long enough - I stopped when the memory was around 82 MB.

In a Linux desktop trunk nightly from 2011-01-16, I see RSS for the chrome process grow to about 67 MB, then drop back to about 60 MB as I continued browsing.

Now I'm testing a Linux desktop trunk nightly from 2011-02-16. RSS for chrome reached 99 MB without stopping.  Then I had to leave for a bit, and while the browser was idle, it dropped back down to 76 MB.
(In reply to comment #5)
> Now I'm testing a Linux desktop trunk nightly from 2011-02-16. RSS for chrome
> reached 99 MB without stopping.  Then I had to leave for a bit, and while the
> browser was idle, it dropped back down to 76 MB.

In another test with the 2011-02-16 nightly, the memory reached 100 MB and then quickly dropped back down to 79 MB while I was still loading pages.  So it looks like the overall pattern has not changed, but the baseline or threshold has.
OS: Android → All
Hardware: ARM → All
Summary: Fennec memory usage increases over time and never goes back down → Fennec chrome process memory can grow up to ~100 MB while loading pages.
Attached file massif data
Massif data, running Fennec for 15 minutes minutes, 25 page loads, 25 page reloads (google search pages). Nothing seems immediately odd, but maybe I missed something.
30 minutes of Fennec loading pages, t1 and t2, each of which waits 2 seconds then loads the other (2 seconds, since that is how long it takes to load a page in valgrind). So, 900 page loads. Memory doesn't appear to increase uncontrollably. It might be that more complicated pages would be different though.

The pages used:

<html>
<body>
<script>
setTimeout(function() {
  window.location = 't2.html';
}, 2000);
</script>
Page 1!
</body>
</html>

etc.
Assignee: nobody → azakai
I had hoped to get some more profiling data, but massif crashes when trying to load most websites. In any case, I think the data above should be enough, and I'm out of ideas for what else to test.

That our memory use rises by 30MB or so before stabilizing is not great, it's probably a combination of fragmentation and arena allocations, but it isn't too bad.
Assignee: azakai → nobody
Ok, got massif to work again (I forgot smc-check before).

Here is 45 minutes of automatically loading google, yahoo alternating in an iframe. There might be a slight gradual increase, but it is very hard to say.
Child process for previous comment.

Seems stable (perhaps a tiny overall increase... not sure), except for a huge spike of 8MB or so (snapshot 25). Looks like that is due to

  26.04% (7,340,032B) 0x5BC0B7D: gfxImageSurface::gfxImageSurface(gfxIntSize const&, gfxASurface::gfxImageFormat)

which doesn't appear in any other snapshot actually. I'm not sure why it doesn't appear at all in any of the others, or why it appears at such a large size in this one case. wesj had a theory that perhaps one of the loads of yahoo.com loaded a different image (they do change in different loads sometimes), that was exceptional in some way.
Isn't jemalloc supposed to handle fragmentation? Or is the fragmentation outside of jemalloc? I could see a win here by properly managing the memory we allocate for layers (assuming it is allocated outside of jemalloc).
jemalloc allocates memory in a way that reduces fragmentation as compared to other allocators. All of our memory is (or at least very much should be) allocated through jemalloc including these layers.
jemalloc doesn't allocate buffers for layers. That's all ashm.
Attached file 7-hour massif: parent
7 hours of massif, loading pages A, B in alteration, each with an iframe containing mozilla's mobile site or twitter's mobile site.

7 hours, with about 7 seconds to load a page (as this is in valgrind), gives about 3,600 page loads, 100% CPU all the while.

This file is the parent process.
Attached file 7-hour massif: child
Child data.
Looks like there is a slow, gradual increase. Perhaps 2MB over the whole 7 hours.

I will work on the logs to figure out what is changing here.
Assignee: nobody → azakai
tracking-fennec: ? → 2.0-
If we are only rowing 2MB over 7 hours, I think we shouldn't block on this
Attached file test page
Test page used in the data I will post next.
Attached file massdiff data
I wrote a script to diff massif snapshots, http://hg.mozilla.org/users/azakai_mozilla.com/patches/file/tip/massdiff.py

Ran the script on massif data from the test page in the previous comment, over 1 hour, which included 528 pageloads (4147 seconds, or 7.85 seconds per pageload).

The massdiff output here is of snapshots 33 and 66, out of 95 (just to avoid anything special near start or near shutdown).

It looks like some significant things are:

 JS:
   * 1,048,576 from AllocateArena in jsgc.cpp
   * 151,248 from js_malloc
   * 27,072 from js_NativeGetInline->XPCConvert->moz_malloc

 HTTP:
   * 76,000 from HttpChannelParent::RecvAsyncOpen->nsStringBuffer::Realloc
   * Later some more in nsHttpChannel::OnStartRequest etc.


Definitely looks like we have a slow gradual increase in memory use. I estimate 8K per page load. (This doesn't take into account memory fragmentation, which would make things worse.) Of course some increase might be expected, due to session history storing URLs and such.
Summary: Fennec chrome process memory can grow up to ~100 MB while loading pages. → Slow gradual memory increase in Fennec
Math error in last bit - the estimate should be 10.8K per page load.
njn, if you have time I would appreciate your thoughts on both the massdiff script (diff for massif snapshots) linked to in comment #20, and the data posted in that comment (specifically the JS memory increase).
Hmm, the 1MB JS increase is not gradual at all. It appears at that size in snapshot 50 (almost exactly in the middle of the run), and stays at exactly that size.

Here is the full stack. Looks like it's coming from an async messagemanager message.

 04.95% (1,048,576B) 0x5DC7A05: js::gc::Arena<JSObject_Slots8>* AllocateArena<JSObject_Slots8>(JSContext*, unsigned int) (jsgc.cpp:479)
  04.95% (1,048,576B) 0x5DCBEE3: bool RefillTypedFreeList<JSObject_Slots8>(JSContext*, unsigned int) (jsgc.cpp:1160)
   04.95% (1,048,576B) 0x5DBFBE6: RefillFinalizableFreeList(JSContext*, unsigned int) (jsgc.cpp:1185)
    04.95% (1,048,576B) 0x5D4B5BE: JSObject* NewFinalizableGCThing<JSObject>(JSContext*, unsigned int) (jsgcinlines.h:127)
     04.95% (1,048,576B) 0x5D48BDD: js_NewGCObject(JSContext*, js::gc::FinalizeKind) (jsgcinlines.h:139)
      04.95% (1,048,576B) 0x5D5D5EA: JSObject* js::detail::NewObject<false, false>(JSContext*, js::Class*, JSObject*, JSObject*, js::gc::FinalizeKind) (jsobjinlines.h:1081)
       04.95% (1,048,576B) 0x5D5D43D: JSObject* js::NewArray<false>(JSContext*, unsigned int, JSObject*) (jsarray.cpp:3061)
        04.95% (1,048,576B) 0x5D5C61D: js::NewDenseEmptyArray(JSContext*, JSObject*) (jsarray.cpp:3076)
         04.95% (1,048,576B) 0x5E1000F: js_BeginJSONParse(JSContext*, js::Value*, bool) (json.cpp:688)
          04.95% (1,048,576B) 0x5D41D96: JS_BeginJSONParse (jsapi.cpp:5641)
           04.95% (1,048,576B) 0x4ADC8DD: nsFrameMessageManager::ReceiveMessage(nsISupports*, nsAString_internal const&, int, nsAString_internal const&, JSObject*, InfallibleTArray<nsString>*, JSContext*) (nsFrameMessageManager.cpp:387)
            04.95% (1,048,576B) 0x4ADD1ED: nsFrameMessageManager::ReceiveMessage(nsISupports*, nsAString_internal const&, int, nsAString_internal const&, JSObject*, InfallibleTArray<nsString>*, JSContext*) (nsFrameMessageManager.cpp:477)
             04.95% (1,048,576B) 0x58C39C6: mozilla::dom::TabParent::ReceiveMessage(nsString const&, int, nsString const&, InfallibleTArray<nsString>*) (TabParent.cpp:607)
              04.95% (1,048,576B) 0x58C2CC5: mozilla::dom::TabParent::RecvAsyncMessage(nsString const&, nsString const&) (TabParent.cpp:309)
               04.95% (1,048,576B) 0x593B9B8: mozilla::dom::PBrowserParent::OnMessageReceived(IPC::Message const&) (PBrowserParent.cpp:569)
                04.95% (1,048,576B) 0x5948A9E: mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&) (PContentParent.cpp:629)
                 04.95% (1,048,576B) 0x58F2295: mozilla::ipc::AsyncChannel::OnDispatchMessage(IPC::Message const&) (AsyncChannel.cpp:262)
                  04.95% (1,048,576B) 0x58FA8EC: mozilla::ipc::RPCChannel::OnMaybeDequeueOne() (RPCChannel.cpp:433)
                   04.95% (1,048,576B) 0x5900369: void DispatchToMethod<mozilla::ipc::RPCChannel, bool (mozilla::ipc::RPCChannel::*)()>(mozilla::ipc::RPCChannel*, bool (mozilla::ipc::RPCChannel::*)(), Tuple0 const&) (tuple.h:383)
                    04.95% (1,048,576B) 0x59002F5: RunnableMethod<mozilla::ipc::RPCChannel, bool (mozilla::ipc::RPCChannel::*)(), Tuple0>::Run() (task.h:307)
                     04.95% (1,048,576B) 0x58FC0BE: mozilla::ipc::RPCChannel::RefCountedTask::Run()
(In reply to comment #23)
> Hmm, the 1MB JS increase is not gradual at all. It appears at that size in
> snapshot 50 (almost exactly in the middle of the run), and stays at exactly
> that size.
> 
> Here is the full stack. Looks like it's coming from an async messagemanager
> message.
> 
>  04.95% (1,048,576B) 0x5DC7A05: js::gc::Arena<JSObject_Slots8>*
> AllocateArena<JSObject_Slots8>(JSContext*, unsigned int) (jsgc.cpp:479)

Welcome to the world of interpreting Massif's output when it's run on a big complex program like Firefox :/

Firefox has lots of custom allocators that are built on top of malloc.  This is one of them.  What happens is the Firefox makes lots of small requests from that custom allocator, and the custom allocator occasionally allocates a large block (a.k.a. a superblock) and then hands it out in small pieces on subsequent requests.  That's why you see a big jump.  Unfortunately, this behaviour obscures what's happening, because you only see one big allocation rather than lots of little allocations.  And the little allocations may be coming from more than one place.

You should also try massif-visualizer: https://projects.kde.org/projects/extragear/sdk/massif-visualizer.  It's a GUI for Massif files.  It's especially useful if you're trying to analyze the graph, because it's much more precise than the ASCII art version.

You might also like to try Massif's --alloc-fn option.  You can use it to ignore malloc wrappers (eg. huge_malloc), which can make the output easier to read.

And then there's Massif's --pages-as-heap=yes option, which lets you profile at the mmap level instead of the heap level.  It's more comprehensive (covers all allocated memory) but is even harder to interpret.

As for the massdiff.py script... I don't know Python, but it looks pretty reasonable.  Showing the old, new and diff values is interesting, I hadn't though of doing that.  I'm surprised you chose to parse the output of ms_print, rather than Massif's output files.  I probably would do it by writing a script that diff's Massif's output files and produces a diff that can be handled normally by ms_print.  That's how I did things with Cachegrind and cg_diff.
BTW, bug 632234 is a tracking bug for this sort of thing, it's possible that one of the pending fixes for another bug will help this bug.
Blocks: mlk2.0
Thanks njn, that is very helpful.
Depends on: 641663
I did a lot more profiling here with massif/massdiff, and there don't appear to be any remaining problems. So, closing this.

We do, however, gradually increase in size over time, due to JS garbage piling up, caches filling, etc. But these things are not leaks, and the increase eventually peaks. We might tweak the prefs for such things, though (like bug 641663 did).
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
No longer blocks: mlk2.0
No longer blocks: mlk-fx5+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: