Closed Bug 850803 Opened 11 years ago Closed 11 years ago

High memory usage (maybe orphaned DOM nodes) in clock app after creating many alarms

Categories

(Firefox OS Graveyard :: Gaia::Clock, defect)

x86_64
Linux
defect
Not set
major

Tracking

(blocking-b2g:-)

RESOLVED FIXED
blocking-b2g -

People

(Reporter: rwood, Unassigned)

References

Details

(Whiteboard: [MemShrink:P3])

Attachments

(6 files)

Attached file about memory dumps
I have developed a suite of long-running gaia-ui stress tests that use marionette and run on the Unagi device.  Some of these tests slow down after several iteartions; the device UI graduaully slows to a complete crawl and eventually causes marionette to socket timeout.

For example, the set_alarm test which just creates a new clock alarm and verifies it, and repeats:

https://github.com/rwood-moz/gaia-ui-tests/blob/gaiastress/gaiatest/tests/stress/test_stress_set_alarm.py

In this particular example, it took 74 minutes just to run 150 iterations of the test. The test was run on Unagi with a mar12 eng build from pvt builds (build id: 20130312070202).

Attached are two about_memory reports (1 - from the start of test before any iterations, and 2 - after 140 iterations when device is 'crawling').

If you wish to run the actual stress test yourself just email me and I will send you the code location etc. Thanks!
This is blocking gaia-ui endurance testing, as I am not able to run certain tests for long/maximum iterations. For example, I wanted to run 5000 iterations of the SMS conversation test, however it took over 12 hours just to do the first 1000 iterations as the device slowed drastically, so cancelled the test after 1000.
Severity: normal → major
There appears to be a memory leak in the clock app.

> Clock (pid 716)
>
> Explicit Allocations
> 55.25 MB (100.0%) -- explicit
> ├──31.82 MB (57.59%) ── heap-unclassified

The next step here would be to run DMD to find out what's going on.
blocking-b2g: --- → tef?
Attached file dmd and stack trace
Attached about_memory with DMD enabeld and a b2g stack trace, when the device clock app is in very slowed state (after 171 set alarm iterations).
Thanks!

> Clock (pid 787)
>
> 72.71 MB (100.0%) -- explicit
> ├──46.51 MB (63.97%) ── heap-unclassified

Ouch.

Most of this comes from a stack which looks like

> Unreported: ~2,840 blocks in stack trace record 1 of 335
>  ~11,624,120 bytes (~11,624,120 requested / ~0 slop)
>  17.38% of the heap (17.38% cumulative);  23.11% of unreported (23.11% cumulative)
>  Allocated at
>    malloc /home/rwood-gaia/B2G/gecko/memory/build/replace_malloc.c:152 (0x401612ae libmozglue.so+0x42ae)
>    moz_xmalloc /home/rwood-gaia/B2G/gecko/memory/mozalloc/mozalloc.cpp:55 (0x41076076 libxul.so+0xefd076)
>    NS_NewTextNode(nsIContent**, nsNodeInfoManager*) /home/rwood-gaia/B2G/gecko/content/base/src/nsTextNode.cpp:103 (0x40e33020 libxul.so+0x4e9020)
>    nsCOMPtr<nsIContent>::get() const /home/rwood-gaia/B2G/objdir-gecko/dist/include/nsCOMPtr.h:762 (0x40ff4876 libxul.so+0x6aa876)
>    nsHtml5TreeOperation::Perform(nsHtml5TreeOpExecutor*, nsIContent**) /home/rwood-gaia/B2G/gecko/parser/html/nsHtml5TreeOperation.cpp:510 (0x40ff581e libxul.so+0x6ab81e)
>    nsHtml5TreeOpExecutor::FlushDocumentWrite() /home/rwood-gaia/B2G/gecko/parser/html/nsHtml5TreeOpExecutor.cpp:657 (0x40828d5a libxul.so+0x6acd5a)
>    nsHtml5StringParser::Tokenize(nsAString_internal const&, nsIDocument*, bool) /home/rwood-gaia/B2G/gecko/parser/html/nsHtml5StringParser.cpp:130 (0x4082c782 libxul.so+0x6b0782)
>    nsHtml5StringParser::ParseFragment(nsAString_internal const&, nsIContent*, nsIAtom*, int, bool, bool) /home/rwood-gaia/B2G/gecko/parser/html/nsHtml5StringParser.cpp:66 (0x4082c85a libxul.so+0x6b085a)
>    ~AutoRestore /home/rwood-gaia/B2G/objdir-gecko/dist/include/mozilla/AutoRestore.h:40 (0x406246f4 libxul.so+0x4a86f4)
>    mozilla::ErrorResult::operator=(tag_nsresult) /home/rwood-gaia/B2G/objdir-gecko/dist/include/mozilla/ErrorResult.h:38 (0x40e931fe libxul.so+0x5491fe)
>    mozilla::ErrorResult::ErrorCode() const /home/rwood-gaia/B2G/objdir-gecko/dist/include/mozilla/ErrorResult.h:46 (0x411232fe libxul.so+0x7d92fe)
>    CallJSPropertyOpSetter /home/rwood-gaia/B2G/gecko/js/src/jscntxtinlines.h:450 (0x4164f204 libxul.so+0xd05204)
>    SetPropertyOperation /home/rwood-gaia/B2G/gecko/js/src/jsinterpinlines.h:360 (0x40e63b46 libxul.so+0xceab46)
If you don't mind, let's morph this bug into "clock app has high memory usage".  We can fix that and hopefully the slowness will go away.  If it doesn't, we can file a new bug.
Summary: Device UI slows to a crawl after several iterations of the set alarm gaia-ui stress test → High memory usage in clock app after creating many alarms
Whiteboard: [MemShrink]
The dark matter (i.e., heap-unclassified) may be bug 826526.  Hooray for old branches.

In any case, this appears to be a bug in the Gaia app.  It appears to be leaking memory.  Maybe DOM nodes with large amounts of text in them, or something like that.
Summary: High memory usage in clock app after creating many alarms → High memory usage (maybe orphaned DOM nodes) in clock app after creating many alarms
Etienne, who should own this?
Flags: needinfo?(etienne)
I think Ian knows the clock app the best.

Ian, if you need help finding out what exactly is leaking maybe Gabriele can have a quick look to the trace.
Flags: needinfo?(etienne)
Sure, I can try to reproduce and see if I can tell something from the stacks.
What stacks?  I don't think you can tell anything from the DMD stacks other than what we've already gleaned.
(In reply to Justin Lebar [:jlebar] from comment #11)
> What stacks?  I don't think you can tell anything from the DMD stacks other
> than what we've already gleaned.

I was thinking of breaking in with gdb at that point and getting a JS stack-trace to see if it can shed some light on where this is coming from.
Ah, the JS stacks will probably help!
Removing the darkmatter dependency in the hope that this is just bug 826526.
No longer blocks: DarkMatter
So I've tried setting a breakpoint in nsHtml5StringParser::ParseFragment() and seeing what would come up when adding an alarm. I get five different JS traces from it:

0 aev_refreshRepeatMenu() ["app://clock.gaiamobile.org/js/alarm.js":924]
    this = [object Object]
1 aev_load() ["app://clock.gaiamobile.org/js/alarm.js":897]
    this = [object Object]
2 al_handleEvent() ["app://clock.gaiamobile.org/js/alarm.js":333]
    this = [object Object]
3 emitEvent() ["app://clock.gaiamobile.org/shared/js/mouse_event_shim.js":239]
    <failed to get 'this' value>
4 handleTouchEnd() ["app://clock.gaiamobile.org/shared/js/mouse_event_shim.js":125]
    this = [object Window]

0 aev_refreshSoundMenu() ["app://clock.gaiamobile.org/js/alarm.js":938]
    this = [object Object]
1 aev_load() ["app://clock.gaiamobile.org/js/alarm.js":899]
    this = [object Object]
2 al_handleEvent() ["app://clock.gaiamobile.org/js/alarm.js":333]
    this = [object Object]
3 emitEvent() ["app://clock.gaiamobile.org/shared/js/mouse_event_shim.js":239]
    <failed to get 'this' value>
4 handleTouchEnd() ["app://clock.gaiamobile.org/shared/js/mouse_event_shim.js":125]
    this = [object Window]

0 al_fillList() ["app://clock.gaiamobile.org/js/alarm.js":443]
    this = [object Object]
1 al_gotAlarmList() ["app://clock.gaiamobile.org/js/alarm.js":361]
    <failed to get 'this' value>
2 am_gotAlarmList() ["app://clock.gaiamobile.org/js/alarmManager.js":169]
    this = [object Window]
3 anonymous() ["app://clock.gaiamobile.org/js/alarmsdb.js":82]
    this = [object IDBRequest]

0 BV_setStatus() ["app://clock.gaiamobile.org/js/alarm.js":279]
    this = [object Object]
1 anonymous() ["app://clock.gaiamobile.org/js/alarmManager.js":117]
    this = [object DOMRequest]

0 al_refreshItem() ["app://clock.gaiamobile.org/js/alarm.js":400]
    this = [object Object]
1 am_putAlarm() ["app://clock.gaiamobile.org/js/alarmManager.js":111]
    this = [object Window]
2 anonymous() ["app://clock.gaiamobile.org/js/alarmsdb.js":60]
    this = [object IDBRequest]

Not knowing the app very well I can't tell which one could be causing the leak (provided it's really on of these five spots). Can somebody familiar with the app double-check them to see if there's something that looks suspicious there?
The clock app UI shows a list of all the alarms that have been set.  If you set hundreds of alarms, that list will have hundreds of elements in it.  So we expect some memory growth here.

It looks like we may be leaking these elements somehow.  I suspect we may be leaking memory because if these DOM nodes were in the DOM tree, they should have shown up in about:memory.  The fact that they're not (and that we believe we may be hitting bug 826526, or something like that) suggests these are probably DOM nodes held alive outside the DOM tree.
Ian/Gabriele, some things you could do to help here would be:

* Test with bug 826526 applied

* See whether passing --minimize to get_about_memory.py changes anything.

* See whether eliminating the innerHTML calls Gabriele identified in comment 15 changes anything (if so, that might point to a Gecko bug)

* Find a place where we create a long list of DOM nodes to leak.  :)  I skimmed alarm.js and didn't see anything, but maybe it's hiding somewhere.

I think it's pretty likely that the problem is in fillList, because that's the only function that uses innerHTML and also processes more data as we get more alarms.

HTH
> I think it's pretty likely that the problem is in fillList

I mean, I think it's pretty likely that the nodes we're leaking are in fillList.
In order to understand whether this should be tef+ or not, how many alarms should be created in order this to become an issue for the user, and what is exactly the issue for the user?
Flags: needinfo?(rwood)
(In reply to Justin Lebar [:jlebar] from comment #17)
> Ian/Gabriele, some things you could do to help here would be:
> 
> * Test with bug 826526 applied
> 
> * See whether passing --minimize to get_about_memory.py changes anything.

I'll give this a spin today, I'll also try to print out what's being added to the DOM tree (and what was there before) to see if there's something weird going on there.

> I think it's pretty likely that the problem is in fillList, because that's
> the only function that uses innerHTML and also processes more data as we get
> more alarms.

Yes, I think that's also the most likely to cause a problem and it might be caused by a leak in Gecko that's just been triggered by the JS code. If taking it out makes the dark-matter disappear this would be a strong indication of it being the culprit so I'll try this too in my testing.

The only other suspicious snippet is al_refreshItem() which does this:

    alarmItem.parentNode.innerHTML = content;

I'm not used to seeing this pattern (i.e. a node effectively changing itself by altering the contents of its parent) but maybe it's just because I'm not very experienced in writing JS code.
I'm a bit puzzled because I can't seem to reproduce the issue. I'm testing this on an Otoro with a build made from today v1-train's sources. I've taken 5 about:memory snapshots in the following situations:

- Newly launched clock application on a clean profile

- 50 alarms manually added

- Same as above but with minimization

- 100 alarms manually added

- Same as above but with minimization

The overall explicit size grew from 10.85 MiB in the first run to 12.75 in the last one with the previous non-minimized version peaking at 15.08. All the size difference seems to be accounted by the added nodes for the various alarms and I can't see anything else suspicious in there. The application remained quite responsive during the whole time.

I can attach the reports but before spamming the bug I wanted to be sure that I'm doing this correctly. I will try to reproduce this using the stress test script in order to see if it makes a difference or not.
I've tried to reproduce both with the script and manually. In a nutshell creating a 100 alarms manually doesn't seem to trigger the issue (though there's a small difference in the process as I didn't change the label for each and every alarm while the automated test does) see [1]. The script however does, at the end of the run I'm left with a huge heap-unclassified and a larg js-non-window of which most is accounted in decommitted-arenas [2]. Minimizing saves only a handful of KiBs on the manual test [3] but has a huge impact on the automated test essentially bringing down the process footprint to what it should have been (the growth in heap-unclassified goes away entirely and whilst js-non-window doesn't shrink as much most of the difference is accounted by an oversized decommitted-arenas) [4].

[1] attachment #731147 [details]

15.08 MB (100.0%) -- explicit
├───4.72 MB (31.30%) ++ js-non-window
├───4.05 MB (26.84%) ++ window-objects
├───2.16 MB (14.33%) ── heap-unclassified

[2] attachment #731149 [details]

32.41 MB (100.0%) -- explicit
├──17.25 MB (53.23%) ── heap-unclassified
├───7.32 MB (22.59%) ++ js-non-window
├───3.47 MB (10.71%) ++ window-objects

[3] attachment #731148 [details]

12.84 MB (100.0%) -- explicit
├───4.62 MB (35.97%) ++ js-non-window
├───2.35 MB (18.29%) ++ window-objects
├───1.92 MB (14.94%) ── heap-unclassified

[4] attachment #731150 [details]

├───6.38 MB (41.33%) ++ js-non-window
├───3.15 MB (20.38%) ++ window-objects
├───1.95 MB (12.63%) ── heap-unclassified

To sum it up I'm a bit stumped. This doesn't look like a leak anymore to me otherwise the minimization wouldn't clean it up; it looks more like the GC is not kicking in as frequently during the automated test as during manual interaction, or something along the lines.
> it looks more like the GC is not kicking in as frequently during the automated test as 
> during manual interaction

Yes, that must be exactly what's happening.
TEF- based on last couple of comments, please nominate again if you disagree.
blocking-b2g: tef? → -
(In reply to Justin Lebar [:jlebar] from comment #27)
> > it looks more like the GC is not kicking in as frequently during the automated test as 
> > during manual interaction
> 
> Yes, that must be exactly what's happening.

Is this a 'bug' on the b2g side, or is there something we need to be doing on the automated testing side to enable/allow GC to occur more frequently during the tests? Thanks everyone for the feedback!
Flags: needinfo?(rwood)
> is there something we need to be doing on the automated testing side to enable/allow GC 
> to occur more frequently during the tests?

Try inserting some sleep()s.

In general if the results of automated tests don't match the results of the corresponding manual tests, I think that's an error in the automated tests.
> In general if the results of automated tests don't match the results of the
> corresponding manual tests, I think that's an error in the automated tests.

I agree.  Good automated tests are harder to write than they first appear, because of issues like this.  For example, the tests used areweslimyet.com required significant tweaking (including longer pauses between actions) before they gave realistic results.
Adding sleeps did not fix the issue (even with a huge 60 second sleep after each alarm was set) however I tracked the problem to this call in the test:

new_alarms_count = len(self.marionette.find_elements(*clock_object._all_alarms))

(Note the locator is: *clock_object._all_alarms = ('css selector', '#alarms li'))

:mdas, could there possibly be a memory leak in marionette.find_elements? With the above code in, the test/device slows down after several iterations, and eventually after about 150 iterations/alarms set it takes over 2 minutes to get the alarms count (you can actually see the device clock app's seconds-hand stop until the next iteration starts).
Flags: needinfo?(mdas)
Whiteboard: [MemShrink] → [MemShrink:P3]
Hmm, well findElements causes our element manager to cache any found elements, but we clear this cache every time the marionette session is deleted (we call reset(), which does this: http://mxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-elements.js#61). The session gets deleted automatically during a test's tearDown (it's in the CommonTestCase class, which is inherited by all *TestCase classes). So all references to any found elements get removed at the end of each test run. This is the only place I can think of where you can get some memory being eaten up if the GC doesn't run.

If this is the case, we can address https://bugzilla.mozilla.org/show_bug.cgi?id=736592 now, which has been on the backburner. It replaces our caching with weak references.
Flags: needinfo?(mdas)
> So all references to any found elements get removed at the end of each test run.

That behavior seems consistent with what's being described here.
jgriffin has suggested running the gc at test teardown, which should help us in the meantime, and for any other memory heavy stress test where the GC doesn't run often enough. We'll try this approach and see how it goes.
(In reply to Malini Das [:mdas] from comment #35)
> jgriffin has suggested running the gc at test teardown, which should help us
> in the meantime, and for any other memory heavy stress test where the GC
> doesn't run often enough. We'll try this approach and see how it goes.

jlebar pointed out that we don't delete the marionette session between runs (thereby marking the element cache for gc), and in any case, replacing our cache with weak references is a stronger solution.
Wondering if anyone has an updated information for this issue?
Flags: needinfo?
Yes, this issue was resolved with the fix as noted in comment 33. Thanks for the follow-up!
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: