Closed Bug 850803 Opened 13 years ago Closed 12 years ago

High memory usage (maybe orphaned DOM nodes) in clock app after creating many alarms

Categories

(Firefox OS Graveyard :: Gaia::Clock, defect)

Product:

Component:

Platform:

x86_64

Linux

Type:

defect

Priority:

Not set

Severity:

major

Tracking

(blocking-b2g:-)

Status:

RESOLVED FIXED

blocking-b2g

-

People

(Reporter: rwood, Unassigned)

References

Details

(Whiteboard: [MemShrink:P3])

Attachments

(6 files)

about memory dumps 13 years ago Robert Wood [:rwood] 103.83 KB, application/x-gzip		Details
dmd and stack trace 13 years ago Robert Wood [:rwood] 1.99 MB, application/x-gzip		Details
about:memory taken after manually creating 100 alarms 13 years ago Gabriele Svelto [:gsvelto] 62.76 KB, application/x-gzip		Details
about:memory taken after manually creating 100 alarms and minimizing 13 years ago Gabriele Svelto [:gsvelto] 62.16 KB, application/x-gzip		Details
about:memory taken after the script has created 100 alarms 13 years ago Gabriele Svelto [:gsvelto] 68.96 KB, application/x-gzip		Details
about:memory taken after the script has created 100 alarms, minimized 13 years ago Gabriele Svelto [:gsvelto] 62.42 KB, application/x-gzip		Details

Robert Wood [:rwood]

Reporter

Description

•

13 years ago

Attached file about memory dumps — Details

I have developed a suite of long-running gaia-ui stress tests that use marionette and run on the Unagi device. Some of these tests slow down after several iteartions; the device UI graduaully slows to a complete crawl and eventually causes marionette to socket timeout. For example, the set_alarm test which just creates a new clock alarm and verifies it, and repeats: https://github.com/rwood-moz/gaia-ui-tests/blob/gaiastress/gaiatest/tests/stress/test_stress_set_alarm.py In this particular example, it took 74 minutes just to run 150 iterations of the test. The test was run on Unagi with a mar12 eng build from pvt builds (build id: 20130312070202). Attached are two about_memory reports (1 - from the start of test before any iterations, and 2 - after 140 iterations when device is 'crawling'). If you wish to run the actual stress test yourself just email me and I will send you the code location etc. Thanks!

Robert Wood [:rwood]

Reporter

Comment 1

•

13 years ago

This is blocking gaia-ui endurance testing, as I am not able to run certain tests for long/maximum iterations. For example, I wanted to run 5000 iterations of the SMS conversation test, however it took over 12 hours just to do the first 1000 iterations as the device slowed drastically, so cancelled the test after 1000.

Severity: normal → major

Justin Lebar (not reading bugmail)

Comment 2

•

13 years ago

There appears to be a memory leak in the clock app. > Clock (pid 716) > > Explicit Allocations > 55.25 MB (100.0%) -- explicit > ├──31.82 MB (57.59%) ── heap-unclassified The next step here would be to run DMD to find out what's going on.

blocking-b2g: --- → tef?

Justin Lebar (not reading bugmail)

Comment 3

•

13 years ago

https://wiki.mozilla.org/Performance/MemShrink/DMD

Robert Wood [:rwood]

Reporter

Comment 4

•

13 years ago

Attached file dmd and stack trace — Details

Attached about_memory with DMD enabeld and a b2g stack trace, when the device clock app is in very slowed state (after 171 set alarm iterations).

Justin Lebar (not reading bugmail)

Comment 5

•

13 years ago

Thanks! > Clock (pid 787) > > 72.71 MB (100.0%) -- explicit > ├──46.51 MB (63.97%) ── heap-unclassified Ouch. Most of this comes from a stack which looks like > Unreported: ~2,840 blocks in stack trace record 1 of 335 > ~11,624,120 bytes (~11,624,120 requested / ~0 slop) > 17.38% of the heap (17.38% cumulative); 23.11% of unreported (23.11% cumulative) > Allocated at > malloc /home/rwood-gaia/B2G/gecko/memory/build/replace_malloc.c:152 (0x401612ae libmozglue.so+0x42ae) > moz_xmalloc /home/rwood-gaia/B2G/gecko/memory/mozalloc/mozalloc.cpp:55 (0x41076076 libxul.so+0xefd076) > NS_NewTextNode(nsIContent**, nsNodeInfoManager*) /home/rwood-gaia/B2G/gecko/content/base/src/nsTextNode.cpp:103 (0x40e33020 libxul.so+0x4e9020) > nsCOMPtr<nsIContent>::get() const /home/rwood-gaia/B2G/objdir-gecko/dist/include/nsCOMPtr.h:762 (0x40ff4876 libxul.so+0x6aa876) > nsHtml5TreeOperation::Perform(nsHtml5TreeOpExecutor*, nsIContent**) /home/rwood-gaia/B2G/gecko/parser/html/nsHtml5TreeOperation.cpp:510 (0x40ff581e libxul.so+0x6ab81e) > nsHtml5TreeOpExecutor::FlushDocumentWrite() /home/rwood-gaia/B2G/gecko/parser/html/nsHtml5TreeOpExecutor.cpp:657 (0x40828d5a libxul.so+0x6acd5a) > nsHtml5StringParser::Tokenize(nsAString_internal const&, nsIDocument*, bool) /home/rwood-gaia/B2G/gecko/parser/html/nsHtml5StringParser.cpp:130 (0x4082c782 libxul.so+0x6b0782) > nsHtml5StringParser::ParseFragment(nsAString_internal const&, nsIContent*, nsIAtom*, int, bool, bool) /home/rwood-gaia/B2G/gecko/parser/html/nsHtml5StringParser.cpp:66 (0x4082c85a libxul.so+0x6b085a) > ~AutoRestore /home/rwood-gaia/B2G/objdir-gecko/dist/include/mozilla/AutoRestore.h:40 (0x406246f4 libxul.so+0x4a86f4) > mozilla::ErrorResult::operator=(tag_nsresult) /home/rwood-gaia/B2G/objdir-gecko/dist/include/mozilla/ErrorResult.h:38 (0x40e931fe libxul.so+0x5491fe) > mozilla::ErrorResult::ErrorCode() const /home/rwood-gaia/B2G/objdir-gecko/dist/include/mozilla/ErrorResult.h:46 (0x411232fe libxul.so+0x7d92fe) > CallJSPropertyOpSetter /home/rwood-gaia/B2G/gecko/js/src/jscntxtinlines.h:450 (0x4164f204 libxul.so+0xd05204) > SetPropertyOperation /home/rwood-gaia/B2G/gecko/js/src/jsinterpinlines.h:360 (0x40e63b46 libxul.so+0xceab46)

Justin Lebar (not reading bugmail)

Comment 6

•

13 years ago

If you don't mind, let's morph this bug into "clock app has high memory usage". We can fix that and hopefully the slowness will go away. If it doesn't, we can file a new bug.

Summary: Device UI slows to a crawl after several iterations of the set alarm gaia-ui stress test → High memory usage in clock app after creating many alarms

Whiteboard: [MemShrink]

Justin Lebar (not reading bugmail)

Updated

•

13 years ago

Blocks: DarkMatter

Justin Lebar (not reading bugmail)

Comment 7

•

13 years ago

The dark matter (i.e., heap-unclassified) may be bug 826526. Hooray for old branches. In any case, this appears to be a bug in the Gaia app. It appears to be leaking memory. Maybe DOM nodes with large amounts of text in them, or something like that.

Justin Lebar (not reading bugmail)

Updated

•

13 years ago

Summary: High memory usage in clock app after creating many alarms → High memory usage (maybe orphaned DOM nodes) in clock app after creating many alarms

Justin Lebar (not reading bugmail)

Comment 8

•

13 years ago

Etienne, who should own this?

Flags: needinfo?(etienne)

Etienne Segonzac (:etienne)

Comment 9

•

13 years ago

I think Ian knows the clock app the best. Ian, if you need help finding out what exactly is leaking maybe Gabriele can have a quick look to the trace.

Flags: needinfo?(etienne)

Gabriele Svelto [:gsvelto]

Comment 10

•

13 years ago

Sure, I can try to reproduce and see if I can tell something from the stacks.

Justin Lebar (not reading bugmail)

Comment 11

•

13 years ago

What stacks? I don't think you can tell anything from the DMD stacks other than what we've already gleaned.

Gabriele Svelto [:gsvelto]

Comment 12

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #11) > What stacks? I don't think you can tell anything from the DMD stacks other > than what we've already gleaned. I was thinking of breaking in with gdb at that point and getting a JS stack-trace to see if it can shed some light on where this is coming from.

Justin Lebar (not reading bugmail)

Comment 13

•

13 years ago

Ah, the JS stacks will probably help!

Justin Lebar (not reading bugmail)

Comment 14

•

13 years ago

Removing the darkmatter dependency in the hope that this is just bug 826526.

No longer blocks: DarkMatter

Gabriele Svelto [:gsvelto]

Comment 15

•

13 years ago

So I've tried setting a breakpoint in nsHtml5StringParser::ParseFragment() and seeing what would come up when adding an alarm. I get five different JS traces from it: 0 aev_refreshRepeatMenu() ["app://clock.gaiamobile.org/js/alarm.js":924] this = [object Object] 1 aev_load() ["app://clock.gaiamobile.org/js/alarm.js":897] this = [object Object] 2 al_handleEvent() ["app://clock.gaiamobile.org/js/alarm.js":333] this = [object Object] 3 emitEvent() ["app://clock.gaiamobile.org/shared/js/mouse_event_shim.js":239] <failed to get 'this' value> 4 handleTouchEnd() ["app://clock.gaiamobile.org/shared/js/mouse_event_shim.js":125] this = [object Window] 0 aev_refreshSoundMenu() ["app://clock.gaiamobile.org/js/alarm.js":938] this = [object Object] 1 aev_load() ["app://clock.gaiamobile.org/js/alarm.js":899] this = [object Object] 2 al_handleEvent() ["app://clock.gaiamobile.org/js/alarm.js":333] this = [object Object] 3 emitEvent() ["app://clock.gaiamobile.org/shared/js/mouse_event_shim.js":239] <failed to get 'this' value> 4 handleTouchEnd() ["app://clock.gaiamobile.org/shared/js/mouse_event_shim.js":125] this = [object Window] 0 al_fillList() ["app://clock.gaiamobile.org/js/alarm.js":443] this = [object Object] 1 al_gotAlarmList() ["app://clock.gaiamobile.org/js/alarm.js":361] <failed to get 'this' value> 2 am_gotAlarmList() ["app://clock.gaiamobile.org/js/alarmManager.js":169] this = [object Window] 3 anonymous() ["app://clock.gaiamobile.org/js/alarmsdb.js":82] this = [object IDBRequest] 0 BV_setStatus() ["app://clock.gaiamobile.org/js/alarm.js":279] this = [object Object] 1 anonymous() ["app://clock.gaiamobile.org/js/alarmManager.js":117] this = [object DOMRequest] 0 al_refreshItem() ["app://clock.gaiamobile.org/js/alarm.js":400] this = [object Object] 1 am_putAlarm() ["app://clock.gaiamobile.org/js/alarmManager.js":111] this = [object Window] 2 anonymous() ["app://clock.gaiamobile.org/js/alarmsdb.js":60] this = [object IDBRequest] Not knowing the app very well I can't tell which one could be causing the leak (provided it's really on of these five spots). Can somebody familiar with the app double-check them to see if there's something that looks suspicious there?

Justin Lebar (not reading bugmail)

Comment 16

•

13 years ago

The clock app UI shows a list of all the alarms that have been set. If you set hundreds of alarms, that list will have hundreds of elements in it. So we expect some memory growth here. It looks like we may be leaking these elements somehow. I suspect we may be leaking memory because if these DOM nodes were in the DOM tree, they should have shown up in about:memory. The fact that they're not (and that we believe we may be hitting bug 826526, or something like that) suggests these are probably DOM nodes held alive outside the DOM tree.

Justin Lebar (not reading bugmail)

Comment 17

•

13 years ago

Ian/Gabriele, some things you could do to help here would be: * Test with bug 826526 applied * See whether passing --minimize to get_about_memory.py changes anything. * See whether eliminating the innerHTML calls Gabriele identified in comment 15 changes anything (if so, that might point to a Gecko bug) * Find a place where we create a long list of DOM nodes to leak. :) I skimmed alarm.js and didn't see anything, but maybe it's hiding somewhere. I think it's pretty likely that the problem is in fillList, because that's the only function that uses innerHTML and also processes more data as we get more alarms. HTH

Justin Lebar (not reading bugmail)

Comment 18

•

13 years ago

> I think it's pretty likely that the problem is in fillList I mean, I think it's pretty likely that the nodes we're leaking are in fillList.

Daniel Coloma:dcoloma

Comment 19

•

13 years ago

In order to understand whether this should be tef+ or not, how many alarms should be created in order this to become an issue for the user, and what is exactly the issue for the user?

Flags: needinfo?(rwood)

Gabriele Svelto [:gsvelto]

Comment 20

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #17) > Ian/Gabriele, some things you could do to help here would be: > > * Test with bug 826526 applied > > * See whether passing --minimize to get_about_memory.py changes anything. I'll give this a spin today, I'll also try to print out what's being added to the DOM tree (and what was there before) to see if there's something weird going on there. > I think it's pretty likely that the problem is in fillList, because that's > the only function that uses innerHTML and also processes more data as we get > more alarms. Yes, I think that's also the most likely to cause a problem and it might be caused by a leak in Gecko that's just been triggered by the JS code. If taking it out makes the dark-matter disappear this would be a strong indication of it being the culprit so I'll try this too in my testing. The only other suspicious snippet is al_refreshItem() which does this: alarmItem.parentNode.innerHTML = content; I'm not used to seeing this pattern (i.e. a node effectively changing itself by altering the contents of its parent) but maybe it's just because I'm not very experienced in writing JS code.

Gabriele Svelto [:gsvelto]

Comment 21

•

13 years ago

I'm a bit puzzled because I can't seem to reproduce the issue. I'm testing this on an Otoro with a build made from today v1-train's sources. I've taken 5 about:memory snapshots in the following situations: - Newly launched clock application on a clean profile - 50 alarms manually added - Same as above but with minimization - 100 alarms manually added - Same as above but with minimization The overall explicit size grew from 10.85 MiB in the first run to 12.75 in the last one with the previous non-minimized version peaking at 15.08. All the size difference seems to be accounted by the added nodes for the various alarms and I can't see anything else suspicious in there. The application remained quite responsive during the whole time. I can attach the reports but before spamming the bug I wanted to be sure that I'm doing this correctly. I will try to reproduce this using the stress test script in order to see if it makes a difference or not.

Gabriele Svelto [:gsvelto]

Comment 22

•

13 years ago

Attached file about:memory taken after manually creating 100 alarms — Details

Gabriele Svelto [:gsvelto]

Comment 23

•

13 years ago

Attached file about:memory taken after manually creating 100 alarms and minimizing — Details

Gabriele Svelto [:gsvelto]

Comment 24

•

13 years ago

Attached file about:memory taken after the script has created 100 alarms — Details

Gabriele Svelto [:gsvelto]

Comment 25

•

13 years ago

Attached file about:memory taken after the script has created 100 alarms, minimized — Details

Gabriele Svelto [:gsvelto]

Comment 26

•

13 years ago

I've tried to reproduce both with the script and manually. In a nutshell creating a 100 alarms manually doesn't seem to trigger the issue (though there's a small difference in the process as I didn't change the label for each and every alarm while the automated test does) see [1]. The script however does, at the end of the run I'm left with a huge heap-unclassified and a larg js-non-window of which most is accounted in decommitted-arenas [2]. Minimizing saves only a handful of KiBs on the manual test [3] but has a huge impact on the automated test essentially bringing down the process footprint to what it should have been (the growth in heap-unclassified goes away entirely and whilst js-non-window doesn't shrink as much most of the difference is accounted by an oversized decommitted-arenas) [4]. [1] attachment #731147 [details] 15.08 MB (100.0%) -- explicit ├───4.72 MB (31.30%) ++ js-non-window ├───4.05 MB (26.84%) ++ window-objects ├───2.16 MB (14.33%) ── heap-unclassified [2] attachment #731149 [details] 32.41 MB (100.0%) -- explicit ├──17.25 MB (53.23%) ── heap-unclassified ├───7.32 MB (22.59%) ++ js-non-window ├───3.47 MB (10.71%) ++ window-objects [3] attachment #731148 [details] 12.84 MB (100.0%) -- explicit ├───4.62 MB (35.97%) ++ js-non-window ├───2.35 MB (18.29%) ++ window-objects ├───1.92 MB (14.94%) ── heap-unclassified [4] attachment #731150 [details] ├───6.38 MB (41.33%) ++ js-non-window ├───3.15 MB (20.38%) ++ window-objects ├───1.95 MB (12.63%) ── heap-unclassified To sum it up I'm a bit stumped. This doesn't look like a leak anymore to me otherwise the minimization wouldn't clean it up; it looks more like the GC is not kicking in as frequently during the automated test as during manual interaction, or something along the lines.

Justin Lebar (not reading bugmail)

Comment 27

•

13 years ago

> it looks more like the GC is not kicking in as frequently during the automated test as > during manual interaction Yes, that must be exactly what's happening.

Daniel Coloma:dcoloma

Comment 28

•

13 years ago

TEF- based on last couple of comments, please nominate again if you disagree.

blocking-b2g: tef? → -

Robert Wood [:rwood]

Reporter

Comment 29

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #27) > > it looks more like the GC is not kicking in as frequently during the automated test as > > during manual interaction > > Yes, that must be exactly what's happening. Is this a 'bug' on the b2g side, or is there something we need to be doing on the automated testing side to enable/allow GC to occur more frequently during the tests? Thanks everyone for the feedback!

Flags: needinfo?(rwood)

Justin Lebar (not reading bugmail)

Comment 30

•

13 years ago

> is there something we need to be doing on the automated testing side to enable/allow GC > to occur more frequently during the tests? Try inserting some sleep()s. In general if the results of automated tests don't match the results of the corresponding manual tests, I think that's an error in the automated tests.

Nicholas Nethercote [inactive]

Comment 31

•

13 years ago

> In general if the results of automated tests don't match the results of the > corresponding manual tests, I think that's an error in the automated tests. I agree. Good automated tests are harder to write than they first appear, because of issues like this. For example, the tests used areweslimyet.com required significant tweaking (including longer pauses between actions) before they gave realistic results.

Robert Wood [:rwood]

Reporter

Comment 32

•

13 years ago

Adding sleeps did not fix the issue (even with a huge 60 second sleep after each alarm was set) however I tracked the problem to this call in the test: new_alarms_count = len(self.marionette.find_elements(*clock_object._all_alarms)) (Note the locator is: *clock_object._all_alarms = ('css selector', '#alarms li')) :mdas, could there possibly be a memory leak in marionette.find_elements? With the above code in, the test/device slows down after several iterations, and eventually after about 150 iterations/alarms set it takes over 2 minutes to get the alarms count (you can actually see the device clock app's seconds-hand stop until the next iteration starts).

Flags: needinfo?(mdas)

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Whiteboard: [MemShrink] → [MemShrink:P3]

Malini Das [:mdas] - Away, not checking bugmail

Comment 33

•

13 years ago

Hmm, well findElements causes our element manager to cache any found elements, but we clear this cache every time the marionette session is deleted (we call reset(), which does this: http://mxr.mozilla.org/mozilla-central/source/testing/marionette/marionette-elements.js#61). The session gets deleted automatically during a test's tearDown (it's in the CommonTestCase class, which is inherited by all *TestCase classes). So all references to any found elements get removed at the end of each test run. This is the only place I can think of where you can get some memory being eaten up if the GC doesn't run. If this is the case, we can address https://bugzilla.mozilla.org/show_bug.cgi?id=736592 now, which has been on the backburner. It replaces our caching with weak references.

Flags: needinfo?(mdas)

Justin Lebar (not reading bugmail)

Comment 34

•

13 years ago

> So all references to any found elements get removed at the end of each test run. That behavior seems consistent with what's being described here.

Malini Das [:mdas] - Away, not checking bugmail

Updated

•

13 years ago

Depends on: 736592

Malini Das [:mdas] - Away, not checking bugmail

Comment 35

•

13 years ago

jgriffin has suggested running the gc at test teardown, which should help us in the meantime, and for any other memory heavy stress test where the GC doesn't run often enough. We'll try this approach and see how it goes.

Malini Das [:mdas] - Away, not checking bugmail

Comment 36

•

13 years ago

(In reply to Malini Das [:mdas] from comment #35) > jgriffin has suggested running the gc at test teardown, which should help us > in the meantime, and for any other memory heavy stress test where the GC > doesn't run often enough. We'll try this approach and see how it goes. jlebar pointed out that we don't delete the marionette session between runs (thereby marking the element cache for gc), and in any case, replacing our cache with weak references is a stronger solution.

Rick Waldron [:rwaldron]

Comment 37

•

12 years ago

Wondering if anyone has an updated information for this issue?

Flags: needinfo?

Robert Wood [:rwood]

Reporter

Comment 38

•

12 years ago

Yes, this issue was resolved with the fix as noted in comment 33. Thanks for the follow-up!

Status: NEW → RESOLVED

Closed: 12 years ago

Flags: needinfo?

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.