While running the Gaia UI tests, Gaia apps will randomly freeze for a few minutes, during which time the UI is (mostly) unresponsive. This seems to be less frequent today than it was earlier in the week, but still occurs. This completely breaks the tests when it happens. This tends to happen towards the end of the test cycle after many apps have been exercised.
Per the discussion in bug 837187 I gathered a gdb thread dump and an about-memory report with MOZ_DMD enabled, and am attaching those (plus the logcat) here.
Note that the memory dump seemed to hang during "Processing DMD files. This may take a minute or two."...I killed it after 20 minutes but am attaching the raw files.
Created attachment 714694 [details]
Created attachment 714695 [details]
Created attachment 714696 [details]
Created attachment 714697 [details]
Created attachment 714698 [details]
Created attachment 714699 [details]
Created attachment 714700 [details]
output of 'thread apply all bt'
I don't see anything particularly wrong in the gdb output, but maybe cjones will. What concerns me the most is that the main process has an RSS here of 87mb, which is very bad.
I'm not sure that the memory problem is causing your hang, though. I'd expect to see all of the apps on the system killed before we hang, but you still have the homescreen and a browser process alive.
It might be worth checking the output of adb shell dmesg next time this happens.
On the upside, the main process does not have high heap-unclassified (and also the DMD report for the main process was fully processed), but on the downside, most of the memory falls into the one System Principal compartment, which is mostly opaque. This is due to bug 798491.
It actually might be very interesting if you could reproduce this problem with bug 798491 disabled. We'd get a very different memory report, I expect. I think this is just a matter of flipping the "jsloader.reuseGlobal" pref to false in b2g.js.
FWIW the memory report here looks pretty different from the one in bug 837187, although there may be some overlap, because they're both using too much memory in the system compartment.
OK, this guy is *very* suspicious:
Unreported: 61 blocks in stack trace record 4 of 819
499,712 bytes (386,252 requested / 113,460 slop)
1.16% of the heap (5.96% cumulative); 4.09% of unreported (20.93% cumulative)
replace_malloc /home/jgriffin/mozilla-inbound/src/memory/replace/dmd/DMD.cpp:1228 (0x4008b75e libdmd.so+0x375e)
malloc /home/jgriffin/mozilla-inbound/src/memory/build/replace_malloc.c:152 (0x401ff2fa libmozglue.so+0x42fa)
moz_xmalloc /home/jgriffin/mozilla-inbound/src/memory/mozalloc/mozalloc.cpp:55 (0x411bebfa libxul.so+0xfa7bfa)
Channel /home/jgriffin/mozilla-inbound/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc:838 (0x40d780f4 libxul.so+0xb610f4)
mozilla::ipc::OpenDescriptor(mozilla::ipc::TransportDescriptor const&, IPC::Channel::Mode) /home/jgriffin/mozilla-inbound/src/ipc/glue/Transport_posix.cpp:56 (0x40b8eadc libxul.so+0x977adc)
mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&) /home/jgriffin/unagi/objdir-gecko/ipc/ipdl/PContentParent.cpp:2371 (0x410be68a libxul.so+0x9a168a)
This is the ipc::Transport we allocate *per process* for the graphics pipeline. Having 61 of them alive when there are 2 content processes is very worrying ...
This is awesome jgriffin, thanks!
(In reply to Jonathan Griffin (:jgriffin) from comment #7)
> Created attachment 714700 [details]
> output of 'thread apply all bt'
Unfortunately nothing looks wedged here. In particular, all the chromium threads are sitting at epoll, and the compositor thread is one of those.
(In reply to Justin Lebar [:jlebar] from comment #8)
> It actually might be very interesting if you could reproduce this problem
> with bug 798491 disabled. We'd get a very different memory report, I
> expect. I think this is just a matter of flipping the
> "jsloader.reuseGlobal" pref to false in b2g.js.
Ok, I'll do that.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #9)
> OK, this guy is *very* suspicious:
> This is the ipc::Transport we allocate *per process* for the graphics
> pipeline. Having 61 of them alive when there are 2 content processes is
> very worrying ...
FWIW, the freeze usually occurs during the start-app transition, with the transition half-complete on the screen.
Created attachment 715824 [details]
about:memory report when reuseGlobal=false
Created attachment 715825 [details]
DMD reports when reuseGlobal=false
With jsloader.reuseGlobal=false, I get somewhat different symptoms. Instead of a hard freeze followed by normal behavior, the tests just gradually get slower and slower until the phone becomes entirely non-operational. I've attached the memory and DMD dumps from this. I'll attach the output of 'adb shell dmesg' as well.
Created attachment 715827 [details]
output of 'adb shell dmesg'
If these tests are exercising NITZ or the time API, there's an outside chance that bug 842550 could have contributed to hangs.
jgriffin, you are a hero for all this debugging info. Sorry I haven't had a chance to look at it. I may not be able to get to it until next week.
Are the UI freezes still reproducing?
We've changed the automation to restart B2G between each test, to avoid hitting this problem. I'll run them locally without the restart and see if the freeze still occurs.
I ran these tests a couple of times locally. I'd say that the freezing problem is fixed, but other bad problems remain if we don't restart B2G between each test. These problems include all the icons disappearing from the homescreen, IPC errors and child process crashes.
Please to be filing! :)