Last Comment Bug 841976 - Gaia apps randomly freeze during UI tests
: Gaia apps randomly freeze during UI tests
Status: RESOLVED WORKSFORME
[MemShrink:P1]
:
Product: Firefox OS
Classification: Client Software
Component: General (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on: 841993 842550
Blocks:
  Show dependency treegraph
 
Reported: 2013-02-15 17:33 PST by Jonathan Griffin (:jgriffin)
Modified: 2013-02-22 20:10 PST (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
logcat (306.29 KB, text/plain)
2013-02-15 17:34 PST, Jonathan Griffin (:jgriffin)
no flags Details
about:memory report (1.13 MB, text/plain)
2013-02-15 17:34 PST, Jonathan Griffin (:jgriffin)
no flags Details
dmd-b2g-109.txt.gz (1.08 MB, application/x-gzip)
2013-02-15 17:35 PST, Jonathan Griffin (:jgriffin)
no flags Details
dmd-1357043675-2027.txt.gz (160.12 KB, application/x-gzip)
2013-02-15 17:36 PST, Jonathan Griffin (:jgriffin)
no flags Details
dmd-1357043675-937.txt.gz (103.97 KB, application/x-gzip)
2013-02-15 17:36 PST, Jonathan Griffin (:jgriffin)
no flags Details
dmd-1357043675-109.txt.gz (332.29 KB, application/x-gzip)
2013-02-15 17:36 PST, Jonathan Griffin (:jgriffin)
no flags Details
output of 'thread apply all bt' (58.86 KB, text/plain)
2013-02-15 17:38 PST, Jonathan Griffin (:jgriffin)
no flags Details
about:memory report when reuseGlobal=false (1.24 MB, text/plain)
2013-02-19 18:43 PST, Jonathan Griffin (:jgriffin)
no flags Details
DMD reports when reuseGlobal=false (1.05 MB, application/x-gzip)
2013-02-19 18:43 PST, Jonathan Griffin (:jgriffin)
no flags Details
output of 'adb shell dmesg' (129.56 KB, text/plain)
2013-02-19 18:45 PST, Jonathan Griffin (:jgriffin)
no flags Details

Description Jonathan Griffin (:jgriffin) 2013-02-15 17:33:27 PST
While running the Gaia UI tests, Gaia apps will randomly freeze for a few minutes, during which time the UI is (mostly) unresponsive.  This seems to be less frequent today than it was earlier in the week, but still occurs.  This completely breaks the tests when it happens.  This tends to happen towards the end of the test cycle after many apps have been exercised.

Per the discussion in bug 837187 I gathered a gdb thread dump and an about-memory report with MOZ_DMD enabled, and am attaching those (plus the logcat) here.

Note that the memory dump seemed to hang during "Processing DMD files.  This may take a minute or two."...I killed it after 20 minutes but am attaching the raw files.
Comment 1 Jonathan Griffin (:jgriffin) 2013-02-15 17:34:03 PST
Created attachment 714694 [details]
logcat
Comment 2 Jonathan Griffin (:jgriffin) 2013-02-15 17:34:40 PST
Created attachment 714695 [details]
about:memory report
Comment 3 Jonathan Griffin (:jgriffin) 2013-02-15 17:35:27 PST
Created attachment 714696 [details]
dmd-b2g-109.txt.gz
Comment 4 Jonathan Griffin (:jgriffin) 2013-02-15 17:36:00 PST
Created attachment 714697 [details]
dmd-1357043675-2027.txt.gz
Comment 5 Jonathan Griffin (:jgriffin) 2013-02-15 17:36:30 PST
Created attachment 714698 [details]
dmd-1357043675-937.txt.gz
Comment 6 Jonathan Griffin (:jgriffin) 2013-02-15 17:36:57 PST
Created attachment 714699 [details]
dmd-1357043675-109.txt.gz
Comment 7 Jonathan Griffin (:jgriffin) 2013-02-15 17:38:31 PST
Created attachment 714700 [details]
output of 'thread apply all bt'
Comment 8 Justin Lebar (not reading bugmail) 2013-02-15 19:01:22 PST
I don't see anything particularly wrong in the gdb output, but maybe cjones will.  What concerns me the most is that the main process has an RSS here of 87mb, which is very bad.

I'm not sure that the memory problem is causing your hang, though.  I'd expect to see all of the apps on the system killed before we hang, but you still have the homescreen and a browser process alive.

It might be worth checking the output of adb shell dmesg next time this happens.

On the upside, the main process does not have high heap-unclassified (and also the DMD report for the main process was fully processed), but on the downside, most of the memory falls into the one System Principal compartment, which is mostly opaque.  This is due to bug 798491.

It actually might be very interesting if you could reproduce this problem with bug 798491 disabled.  We'd get a very different memory report, I expect.  I think this is just a matter of flipping the "jsloader.reuseGlobal" pref to false in b2g.js.

FWIW the memory report here looks pretty different from the one in bug 837187, although there may be some overlap, because they're both using too much memory in the system compartment.
Comment 9 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-02-15 19:28:25 PST
OK, this guy is *very* suspicious:

Unreported: 61 blocks in stack trace record 4 of 819
 499,712 bytes (386,252 requested / 113,460 slop)
 1.16% of the heap (5.96% cumulative);  4.09% of unreported (20.93% cumulative)
 Allocated at
   replace_malloc /home/jgriffin/mozilla-inbound/src/memory/replace/dmd/DMD.cpp:1228 (0x4008b75e libdmd.so+0x375e)
   malloc /home/jgriffin/mozilla-inbound/src/memory/build/replace_malloc.c:152 (0x401ff2fa libmozglue.so+0x42fa)
   moz_xmalloc /home/jgriffin/mozilla-inbound/src/memory/mozalloc/mozalloc.cpp:55 (0x411bebfa libxul.so+0xfa7bfa)
   Channel /home/jgriffin/mozilla-inbound/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:838 (0x40d780f4 libxul.so+0xb610f4)
   mozilla::ipc::OpenDescriptor(mozilla::ipc::TransportDescriptor const&, IPC::Channel::Mode) /home/jgriffin/mozilla-inbound/src/ipc/glue/Transport_posix.cpp:56 (0x40b8eadc libxul.so+0x977adc)
   mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&) /home/jgriffin/unagi/objdir-gecko/ipc/ipdl/PContentParent.cpp:2371 (0x410be68a libxul.so+0x9a168a)

This is the ipc::Transport we allocate *per process* for the graphics pipeline.  Having 61 of them alive when there are 2 content processes is very worrying ...
Comment 10 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-02-15 19:36:29 PST
This is awesome jgriffin, thanks!

(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> Created attachment 714700 [details]
> output of 'thread apply all bt'

Unfortunately nothing looks wedged here.  In particular, all the chromium threads are sitting at epoll, and the compositor thread is one of those.
Comment 11 Jonathan Griffin (:jgriffin) 2013-02-15 20:49:48 PST
(In reply to Justin Lebar [:jlebar] from comment #8)
> 
> It actually might be very interesting if you could reproduce this problem
> with bug 798491 disabled.  We'd get a very different memory report, I
> expect.  I think this is just a matter of flipping the
> "jsloader.reuseGlobal" pref to false in b2g.js.
> 

Ok, I'll do that.

(In reply to Chris Jones [:cjones] [:warhammer] from comment #9)
> OK, this guy is *very* suspicious:
> 
> This is the ipc::Transport we allocate *per process* for the graphics
> pipeline.  Having 61 of them alive when there are 2 content processes is
> very worrying ...

FWIW, the freeze usually occurs during the start-app transition, with the transition half-complete on the screen.
Comment 12 Jonathan Griffin (:jgriffin) 2013-02-19 18:43:20 PST
Created attachment 715824 [details]
about:memory report when reuseGlobal=false
Comment 13 Jonathan Griffin (:jgriffin) 2013-02-19 18:43:49 PST
Created attachment 715825 [details]
DMD reports when reuseGlobal=false
Comment 14 Jonathan Griffin (:jgriffin) 2013-02-19 18:45:15 PST
With jsloader.reuseGlobal=false, I get somewhat different symptoms.   Instead of a hard freeze followed by normal behavior, the tests just gradually get slower and slower until the phone becomes entirely non-operational.  I've attached the memory and DMD dumps from this.  I'll attach the output of 'adb shell dmesg' as well.
Comment 15 Jonathan Griffin (:jgriffin) 2013-02-19 18:45:43 PST
Created attachment 715827 [details]
output of 'adb shell dmesg'
Comment 16 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-02-21 17:30:25 PST
If these tests are exercising NITZ or the time API, there's an outside chance that bug 842550 could have contributed to hangs.
Comment 17 Justin Lebar (not reading bugmail) 2013-02-21 17:33:33 PST
jgriffin, you are a hero for all this debugging info.  Sorry I haven't had a chance to look at it.  I may not be able to get to it until next week.
Comment 18 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-02-22 13:32:35 PST
Are the UI freezes still reproducing?
Comment 19 Jonathan Griffin (:jgriffin) 2013-02-22 14:15:50 PST
We've changed the automation to restart B2G between each test, to avoid hitting this problem.  I'll run them locally without the restart and see if the freeze still occurs.
Comment 20 Jonathan Griffin (:jgriffin) 2013-02-22 18:57:42 PST
I ran these tests a couple of times locally.  I'd say that the freezing problem is fixed, but other bad problems remain if we don't restart B2G between each test.  These problems include all the icons disappearing from the homescreen, IPC errors and child process crashes.
Comment 21 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-02-22 20:10:42 PST
Please to be filing! :)

Note You need to log in before you can comment on or make changes to this bug.