Closed Bug 851626 Opened 12 years ago Closed 11 years ago

[B2G][Camera][Gallery] Crash when switching repeatedly between Gallery and Camera apps

Categories

(Firefox OS Graveyard :: Gaia::Camera, defect, P2)

ARM
Gonk (Firefox OS)
defect

Tracking

(firefox26 affected, firefox27 affected, b2g18 unaffected)

RESOLVED WONTFIX
1.2 C3(Oct25)
Tracking Status
firefox26 --- affected
firefox27 --- affected
b2g18 --- unaffected

People

(Reporter: jcouassi, Unassigned)

References

Details

(Whiteboard: [mozilla-triage][MemShrink:P2] [TD-59414])

Attachments

(14 files)

282.37 KB, text/plain
Details
5.10 KB, text/plain
Details
15.46 KB, text/x-log
Details
1.65 MB, application/x-zip-compressed
Details
1.77 MB, application/x-zip-compressed
Details
16.78 KB, application/x-zip-compressed
Details
719 bytes, patch
Details | Diff | Splinter Review
9.07 MB, application/octet-stream
Details
1.32 KB, text/x-python
Details
2.31 MB, application/x-zip-compressed
Details
3.66 MB, application/x-zip-compressed
Details
1.75 MB, application/x-zip-compressed
Details
1.38 MB, application/x-xz
Details
2.75 KB, text/plain
Details
Description: You can switch from Camera to Gallery over and over then Camera Crashes Repro Steps: 1) Updated to Unagi Build ID: 20130314114915 2. Launch Gallery Application from homescreen 3. Tap on Camera Icon (From inside Gallery) 4. Tap on Gallery Application (From inside Camera) 5. Repeat steps 4 or 5 times. 6. View what happens Expected Device switches back and forth to proper application Actual: After switching back and forth at a fast past you will get a notification saying that Camera has crashed Repro frequency: 5/5 100% Environmental Variables: Kernel Date: Dec 5 Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/8e9dd87b4f3b Gaia: 69dbcd84085f10bec0c0189b926ffb535b14dcfe Notes: Checked on Master build as well. Issue repros Log attached to bug.
Crash report: https://crash-stats.mozilla.com/report/index/fb45a703-39cd-48b1-8c6c-fa5532130315 Signature android::GonkCameraHardware::PullParameters More Reports Search UUID fb45a703-39cd-48b1-8c6c-fa5532130315 Date Processed 2013-03-15 18:39:05 Process Type content Uptime 6 Install Age 20.4 hours since version was first installed. Install Time 2013-03-14 22:02:27 Product B2G Version 18.0 Build ID 20130314114915 Release Channel nightly OS Android OS Version 0.0.0 Linux 3.0.8-perf #1 PREEMPT Wed Dec 5 04:47:49 PST 2012 armv7l toro/full_unagi/unagi:4.0.4.0.4.0.4/OPENMASTER/eng.cltbld.20130306.101604:user/test-keys Build Architecture arm Build Architecture Info Crash Reason SIGSEGV Crash Address 0x18 User Comments App Notes EGL? EGL+ GL Context? GL Context+ GL Layers? GL Layers+ Processor Notes sp-processor02.phx1.mozilla.com_2251:2008; this crash has been processed more than once; WARNING: JSON file missing Add-ons; exploitablity tool: ERROR: unable to analyze dump EMCheckCompatibility False Winsock LSP Adapter Vendor ID Adapter Device ID Device toro unagi1 Android API Version 15(AOSP) Android CPU ABI armeabi-v7a Bugzilla - Report this bug in B2G, Core, Plug-Ins, or Toolkit Related Bugs 850845 NEW --- Camera - crash when trying to open a second camera instance Crashing Thread Frame Module Signature Source 0 libxul.so android::GonkCameraHardware::PullParameters GonkCameraHwMgr.cpp:281 1 libxul.so mozilla::nsGonkCameraControl::PullParametersImpl GonkCameraControl.cpp:855 2 libxul.so mozilla::nsGonkCameraControl::Init GonkCameraControl.cpp:233 3 libxul.so InitGonkCameraControl::Run GonkCameraControl.cpp:179 4 libxul.so nsThread::ProcessNextEvent nsThread.cpp:620 5 libxul.so NS_ProcessNextEvent_P nsThreadUtils.cpp:237 6 libxul.so nsThread::ThreadFunc nsThread.cpp:258 7 libnspr4.so _pt_root ptthread.c:191 8 libc.so __thread_entry pthread.c:217 9 libc.so pthread_create pthread.c:357
This looks a lot like the crash in bug 850845, for which there is a patch pending (and that will land as soon as m-i is reopened). I'm a little concerned that we can reach this point, though; even with the aforementioned fix, the camera will fail to start even if it doesn't crash.
Unable to reproduce on unagi with: - gecko: inbound-src:c45d34db0d69 - gaia: c4d153b9f2f079400ce0eac73ea04137098230a0 Repeated STR steps 3 and 4 20+ times. Will try on b2g18 branch.
Unable to reproduce on unagi with: - gecko: b2g18:a827df06cffb - gaia: c4d153b9f2f079400ce0eac73ea04137098230a0 I'm running a DEBUG build, however, which may slow things down enough to hide a race condition. Will try with a non-DEBUG build.
Unable to reproduce with non-DEBUG build.
Issue repros on Unagi Build ID: 20130404070202 Kernel Date: Dec 5 Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/da523063aa7b Gaia: a845be046c5d3cb077e3c78f963ca5c079e7ab3d Once you switch back in forth from gallery to camera around 7 or 8 times the camera/gallery crashes and it takes you back to the homescreen.
(In reply to Jeni from comment #6) > Issue repros on > > Unagi Build ID: 20130404070202 I am unable to reproduce this issue using this specific build, even after 60+ switches (30+ completely cycles) between the camera app and the gallery. Jeni, can you try repeating this test with a fresh, empty memory card? (Please _don't_ erase the memory card you're currently using--if this does turn out to be a crash due to a specfic image, we'll need to examine your images to determine the cause.)
(In reply to Mike Habicher [:mikeh] from comment #7) > > I am unable to reproduce this issue using this specific build, even after > 60+ switches (30+ completely cycles) between the camera app and the gallery. > > Jeni, can you try repeating this test with a fresh, empty memory card? > (Please _don't_ erase the memory card you're currently using--if this does > turn out to be a crash due to a specfic image, we'll need to examine your > images to determine the cause.) Unable to repro issue with only 37 pictures/39.3 mb used. SD card used on the device that has 2.3 GB used for pictures issue does repro.
Thanks, Jeni! Sounds like an out-of-memory issue. djf?
Flags: needinfo?(dflanagan)
I've more-or-less given up on OOMs with gallery. People keep testing it by putting big honking 5 megapixel images that don't have good EXIF previews on their SD cards. Gecko can't handle it. See bug 854783. In general, Gallery can do a good job with photos from the Camera app. But given the current limitations of gecko, it cannot handle large images gracefully. If we get a fix for bug 854799, that will go a long way to fixing the problem. Jeni, is the gallery app scanning photos when this crash occurs (crawling ants animations at the top of the screen)? If so, and if the photos are not photos from the camera app, then it is probably eating up lots and lots of memory, which means that other apps get killed to free up more memory. (And then, if we're unlucky, the gallery app gets killed too). And if this is the case, then this probably has nothing to do with switching back and forth between apps. If gallery is trying to scan a bunch of big images, other apps are going to be killed to make room. This is basically normal. So, if scanning is happening when this occurs, then I recommend closing this bug as a dupe of bug 854783.
Flags: needinfo?(dflanagan) → needinfo?(jcouassi)
On the other hand, I don't usually see a "app has crashed" notification with OOMs, so if there is actually a notification and a real crash report, then maybe something else is going on. Mike, can you tell if there is a real crash here? Could the app being killed because of memory pressure be causing a crash somehow?
Flags: needinfo?(mhabicher)
djf, to determine if an app is killed due to OoM, you need to look in the kernel logs: 'adb shell dmesg'. I don't remember the exact strings, but they're pretty obvious.
Flags: needinfo?(mhabicher)
(In reply to David Flanagan [:djf] from comment #10) > Jeni, is the gallery app scanning photos when this crash occurs (crawling > ants animations at the top of the screen)? If so, and if the photos are not > photos from the camera app, then it is probably eating up lots and lots of > memory, which means that other apps get killed to free up more memory. (And > then, if we're unlucky, the gallery app gets killed too). And if this is > the case, then this probably has nothing to do with switching back and forth > between apps. If gallery is trying to scan a bunch of big images, other apps > are going to be killed to make room. This is basically normal. I am seeing the animation on the top of the screen with the main device I have been working with. I have both had applications running in the background and not had applications running while testing this issue and it occurs in both cases. I did also remove pictures I added and took a bunch of pictures (383.7 mb) and still had same issue with it crashing
Flags: needinfo?(jcouassi)
I wrote an automated test for this today and will run it overnight, to try to reproduce the camera crash.
I ran the automated test several times on my own engineering builds on unagi (over 1000 iterations total) and it passed (no crashes reproduced). Just had one photo on the usd card and no extra apps running in the background.
Issue repros in Inari Build ID: 20130503070205 Kernel Date: Feb 21 Gecko: http://hg.mozilla.org/releases/mozilla-b2g18_v1_0_1/rev/3f3489356bbc Gaia: 3e232bce289c9e156d92553e752616cba284bc8f And in Unagi Build ID: 20130503070204 Kernel Date: Dec 5 Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/8becaf2a0bc7 Gaia: b0aca0dd1e2955e11190ede725e1fb9ee596438b Once you switch back in forth from gallery to camera around 7 or 8 times the camera/gallery crashes and it takes you back to the homescreen or will freeze on Gallery with no buttons listed and one or two pictures showing.
Note: This crash is reproducible via the gaia-ui gallery_camera endurance test on Inari with b2g 18 v1.0.1.
rwood, when this crash happens, do you see any OoM errors in the kernel logs? 'adb shell dmesg'.
blocking-b2g: --- → leo?
Recommend not blocking because this is a stress test and not a normal user scenario.
Whiteboard: [mozilla-triage]
leo think it's a normal user scenario and blocks
blocking-b2g: leo? → leo+
Summary: [B2G][Camera][Gallery]Camera crashes when switch repeatedly between Gallery and Camera mode → [B2G][Camera][Gallery] Crash when switching repeatedly between Gallery and Camera apps
Trying to reproduce this in Inari manually (as per comment 16) with: - gecko: b2g18:78de618c071a - gaia: v1.0.1:42b5b9f6d6c045039e1bd88cd32d5f850e3d3750 Unable to do so. 'adb shell b2g-ps' shows fluctuations in VSIZE and RSS, but nothing indicating an obvious resource leak.
Assignee: nobody → mhabicher
Initial: 20130612160407 Gaia Endurance Test: gallery_camera 20130612160407 Checkpoint after iteration 10 of 100: APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME b2g root 1524 1 193640 76696 ffffffff 400f06ec R /system/b2g/b2g Homescreen app_1610 1610 1524 73320 27832 ffffffff 400f3330 S /system/b2g/plugin-container Usage app_1622 1622 1524 65516 24900 ffffffff 400e3330 S /system/b2g/plugin-container Gallery app_1636 1636 1524 74588 28144 ffffffff 40038330 S /system/b2g/plugin-container Camera app_1655 1655 1524 72384 24872 ffffffff 4001c330 S /system/b2g/plugin-container Final: 20130612170959 Checkpoint after iteration 100 of 100: APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME b2g root 1524 1 228024 113376 ffffffff 4100cdce R /system/b2g/b2g Homescreen app_1610 1610 1524 73320 17224 ffffffff 400f3330 S /system/b2g/plugin-container Gallery app_1636 1636 1524 89980 25892 ffffffff 40038330 S /system/b2g/plugin-container This test loads the SD card with a single image of the Firefox logo; the image file is a 355 KiB JPG.
Wow, it gets better--it seems that with a DEBUG build running on Inari with: - gecko: b2g18:7609f4d7e9b0 - gaia: v1.0.1:ed3b9e7ed0e083cd1c587c160d6a63440b29fad8 ...very early on in the iterations, switching to the Gallery app causes the camera to die; and vice-versa. Is there something wrong with the LMK? There also appear to be, at times, two preallocated processes. Kernel messages: # adb shell dmesg | grep kswapd0 <4>[06-12 22:12:09.080] [25: kswapd0]select 999 (Camera), adj 6, size 7082, to kill <4>[06-12 22:12:09.080] [25: kswapd0]send sigkill to 999 (Camera), adj 6, size 7082 <4>[06-12 22:14:56.494] [25: kswapd0]select 1139 (Camera), adj 6, size 7232, to kill <4>[06-12 22:14:56.494] [25: kswapd0]send sigkill to 1139 (Camera), adj 6, size 7232 <4>[06-12 22:16:22.528] [25: kswapd0]select 1209 (Camera), adj 6, size 6661, to kill <4>[06-12 22:16:22.528] [25: kswapd0]send sigkill to 1209 (Camera), adj 6, size 6661 <4>[06-12 22:17:49.873] [25: kswapd0]select 1281 (Camera), adj 6, size 6727, to kill <4>[06-12 22:17:49.873] [25: kswapd0]send sigkill to 1281 (Camera), adj 6, size 6727 <4>[06-12 22:18:45.797] [25: kswapd0]select 1316 (Gallery), adj 6, size 5413, to kill <4>[06-12 22:18:45.797] [25: kswapd0]send sigkill to 1316 (Gallery), adj 6, size 5413 <4>[06-12 22:20:44.814] [25: kswapd0]select 1420 (Camera), adj 6, size 7544, to kill <4>[06-12 22:20:44.814] [25: kswapd0]send sigkill to 1420 (Camera), adj 6, size 7544 <4>[06-12 22:21:37.495] [25: kswapd0]select 1456 (Gallery), adj 6, size 7992, to kill <4>[06-12 22:21:37.495] [25: kswapd0]send sigkill to 1456 (Gallery), adj 6, size 7992 <4>[06-12 22:23:07.092] [25: kswapd0]select 1695 (Gallery), adj 6, size 8276, to kill <4>[06-12 22:23:07.092] [25: kswapd0]send sigkill to 1695 (Gallery), adj 6, size 8276 <4>[06-12 22:23:37.832] [25: kswapd0]select 1874 (Camera), adj 6, size 7950, to kill <4>[06-12 22:23:37.832] [25: kswapd0]send sigkill to 1874 (Camera), adj 6, size 7950 <4>[06-12 22:24:33.216] [25: kswapd0]select 1910 (Gallery), adj 6, size 7891, to kill <4>[06-12 22:24:33.216] [25: kswapd0]send sigkill to 1910 (Gallery), adj 6, size 7891 <4>[06-12 22:25:06.499] [25: kswapd0]select 1943 (Camera), adj 6, size 7515, to kill <4>[06-12 22:25:06.499] [25: kswapd0]send sigkill to 1943 (Camera), adj 6, size 7515 <4>[06-12 22:26:01.713] [25: kswapd0]select 1980 (Gallery), adj 6, size 7085, to kill <4>[06-12 22:26:01.713] [25: kswapd0]send sigkill to 1980 (Gallery), adj 6, size 7085 jlebar: have we regressed on memory recently?
Flags: needinfo?(justin.lebar+bug)
> jlebar: have we regressed on memory recently? Not to my knowledge. What's happening in comment 22 is that the main process grows in size from 77mb to 113mb. Now we have less space for apps. We need a get_about_memory.py dump after the main process is using 110+mb of RAM.
Flags: needinfo?(justin.lebar+bug)
This is the b2g-ps output from a freshly-rebooted device (ignore comment 22, it's from a previous run); it goes with the following grepped (truncated, it seems) kernel log entries: <4>[06-12 22:28:05.123] [25: kswapd0]select 2104 (Camera), adj 6, size 6781, to kill <4>[06-12 22:28:05.123] [25: kswapd0]send sigkill to 2104 (Camera), adj 6, size 6781 <4>[06-12 22:30:38.673] [25: kswapd0]select 2211 (Gallery), adj 6, size 6446, to kill <4>[06-12 22:30:38.673] [25: kswapd0]send sigkill to 2211 (Gallery), adj 6, size 6446 <4>[06-12 22:31:12.916] [25: kswapd0]select 2246 (Camera), adj 6, size 6787, to kill <4>[06-12 22:31:12.916] [25: kswapd0]send sigkill to 2246 (Camera), adj 6, size 6787 <4>[06-12 22:32:11.163] [25: kswapd0]select 2282 (Gallery), adj 6, size 7686, to kill <4>[06-12 22:32:11.163] [25: kswapd0]send sigkill to 2282 (Gallery), adj 6, size 7686 <4>[06-12 22:32:44.906] [25: kswapd0]select 2316 (Camera), adj 6, size 7281, to kill <4>[06-12 22:32:44.906] [25: kswapd0]send sigkill to 2316 (Camera), adj 6, size 7281 <4>[06-12 22:33:41.661] [25: kswapd0]select 2352 (Gallery), adj 6, size 7249, to kill <4>[06-12 22:33:41.661] [25: kswapd0]send sigkill to 2352 (Gallery), adj 6, size 7249 <4>[06-12 22:35:51.598] [25: kswapd0]select 2454 (Camera), adj 6, size 6886, to kill <4>[06-12 22:35:51.598] [25: kswapd0]send sigkill to 2454 (Camera), adj 6, size 6886 <4>[06-12 22:36:59.354] [25: kswapd0]select 2490 (Gallery), adj 6, size 5106, to kill <4>[06-12 22:36:59.354] [25: kswapd0]send sigkill to 2490 (Gallery), adj 6, size 5106 <4>[06-12 22:37:32.256] [25: kswapd0]select 2524 (Camera), adj 6, size 6209, to kill <4>[06-12 22:37:32.256] [25: kswapd0]send sigkill to 2524 (Camera), adj 6, size 6209 <4>[06-12 22:38:34.257] [25: kswapd0]select 2560 (Gallery), adj 6, size 7304, to kill <4>[06-12 22:38:34.257] [25: kswapd0]send sigkill to 2560 (Gallery), adj 6, size 7304 <4>[06-12 22:39:11.353] [25: kswapd0]select 2595 (Camera), adj 6, size 7020, to kill <4>[06-12 22:39:11.353] [25: kswapd0]send sigkill to 2595 (Camera), adj 6, size 7020 <4>[06-12 22:40:14.094] [25: kswapd0]select 2631 (Gallery), adj 6, size 6957, to kill <4>[06-12 22:40:14.094] [25: kswapd0]send sigkill to 2631 (Gallery), adj 6, size 6957 <4>[06-12 22:40:51.671] [25: kswapd0]select 2664 (Camera), adj 6, size 6672, to kill <4>[06-12 22:40:51.671] [25: kswapd0]send sigkill to 2664 (Camera), adj 6, size 6672 <4>[06-12 22:41:57.185] [25: kswapd0]select 2699 (Gallery), adj 6, size 5624, to kill <4>[06-12 22:41:57.185] [25: kswapd0]send sigkill to 2699 (Gallery), adj 6, size 5624 <4>[06-12 22:43:43.338] [25: kswapd0]select 2773 (Gallery), adj 6, size 5509, to kill <4>[06-12 22:43:43.338] [25: kswapd0]send sigkill to 2773 (Gallery), adj 6, size 5509 <4>[06-12 22:44:30.795] [25: kswapd0]select 2807 (Camera), adj 6, size 5510, to kill <4>[06-12 22:44:30.795] [25: kswapd0]send sigkill to 2807 (Camera), adj 6, size 5510 <4>[06-12 22:46:11.973] [25: kswapd0]select 2879 (Camera), adj 6, size 6700, to kill <4>[06-12 22:46:11.973] [25: kswapd0]send sigkill to 2879 (Camera), adj 6, size 6700 <4>[06-12 22:47:54.423] [25: kswapd0]select 2951 (Camera), adj 6, size 6515, to kill <4>[06-12 22:47:54.423] [25: kswapd0]send sigkill to 2951 (Camera), adj 6, size 6515 I forgot to build with DMD enabled, but will get that data next. In the attachment, you can see that even intially with b2g.VSIZE=195908 and .RSS=93516, after a cycle of: a. open/switch to Camera app b. switch to Gallery app c. grab b2g-ps ...the Camera is killed.
Attachment #761754 - Flags: feedback?(justin.lebar+bug)
fwiw you don't need to build with DMD enabled to do get_about_memory.py. We only need DMD if the result of get_about_memory.py shows high heap-unclassified.
The 110+mb main process memory usage shouldn't be happening. But aside from that, everything looks like it's working properly. At cjones's insistence, the preallocated process runs with the same priority as other bg processes. And the homescreen app runs with higher priority. So after those two apps, we don't have a lot of space left. It would be a bit interesting to see what the output of b2g-info is, because it will show you how much memory is actually free on the system. b2g-info isn't merged into mainline yet, but you can get it with something like $ cd root/b2g/checkout $ git remote add https://github.com/jlebar/B2G jlebar $ git fetch jlebar $ git checkout b2g-info At this point you can do either: $ ./build.sh && ./flash.sh or $ ./build.sh b2g-info $ adb remount $ adb push out/target/product/<XXX>/system/bin/b2g-info /system/bin
DMD report for the b2g parent process shows: ------------------------------------------------------------------ Unreported stack trace records ------------------------------------------------------------------ Unreported: 13 blocks in stack trace record 1 of 713 7,987,200 bytes (7,987,200 requested / 0 slop) 14.16% of the heap (14.16% cumulative); 31.49% of unreported (31.49% cumulative) Allocated at malloc /home/mikeh/dev/mozilla/m-c/b2g18/memory/build/replace_malloc.c:152 (0x400f142c libmozglue.so+0x442c) yyalloc /home/mikeh/dev/mozilla/btg024/objdir-gecko-b2g18-debug-dmd/gfx/angle/glslang_lex.cpp:2930 (0x413bd540 libxul.so+0x1273540) gfxImageSurface /home/mikeh/dev/mozilla/m-c/b2g18/gfx/thebes/gfxImageSurface.cpp:111 (0x411f630a libxul.so+0x10ac30a) nsRefPtr<gfxASurface>::assign_with_AddRef(gfxASurface*) /home/mikeh/dev/mozilla/btg024/objdir-gecko-b2g18-debug-dmd/dist/include/nsAutoPtr.h:844 (0x41216310 libxul.so+0x10cc310) nsRefPtr<gfxASurface> /home/mikeh/dev/mozilla/btg024/objdir-gecko-b2g18-debug-dmd/dist/include/nsAutoPtr.h:903 (0x41209920 libxul.so+0x10bf920) nsRefPtr<gfxASurface>::assign_assuming_AddRef(gfxASurface*) /home/mikeh/dev/mozilla/btg024/objdir-gecko-b2g18-debug-dmd/dist/include/nsAutoPtr.h:859 (0x40546c60 libxul.so+0x3fcc60) mozilla::image::RasterImage::DecodingComplete() /home/mikeh/dev/mozilla/btg024/objdir-gecko-b2g18-debug-dmd/dist/include/nsError.h:1065 (0x4054073a libxul.so+0x3f673a) mozilla::image::Decoder::PostDecodeDone() /home/mikeh/dev/mozilla/btg024/objdir-gecko-b2g18-debug-dmd/dist/include/nsCOMPtr.h:762 (0x4053aabc libxul.so+0x3f0abc) mozilla::image::nsJPEGDecoder::NotifyDone() /home/mikeh/dev/mozilla/m-c/b2g18/image/decoders/nsJPEGDecoder.cpp:533 (0x40cb3680 libxul.so+0x40d680) mozilla::image::term_source(jpeg_decompress_struct*) /home/mikeh/dev/mozilla/m-c/b2g18/image/decoders/nsJPEGDecoder.cpp:851 (0x40cb36b4 libxul.so+0x40d6b4) jpeg_finish_decompress /home/mikeh/dev/mozilla/m-c/b2g18/media/libjpeg/jdapimin.c:393 (0x41af735e libxul.so+0x125135e) mozilla::image::nsJPEGDecoder::WriteInternal(char const*, unsigned int) /home/mikeh/dev/mozilla/m-c/b2g18/image/decoders/nsJPEGDecoder.cpp:502 (0x40cb42d4 libxul.so+0x40e2d4) mozilla::image::Decoder::Write(char const*, unsigned int) /home/mikeh/dev/mozilla/m-c/b2g18/image/src/Decoder.cpp:81 (0x4053a978 libxul.so+0x3f0978) mozilla::image::RasterImage::WriteToDecoder(char const*, unsigned int) /home/mikeh/dev/mozilla/m-c/b2g18/image/src/RasterImage.cpp:2501 (0x4053ff28 libxul.so+0x3f5f28) mozilla::image::RasterImage::DecodeSomeData(unsigned int) /home/mikeh/dev/mozilla/m-c/b2g18/image/src/RasterImage.cpp:3098 (0x40540006 libxul.so+0x3f6006) mozilla::image::RasterImage::DecodeWorker::DecodeSomeOfImage(mozilla::image::RasterImage*, mozilla::image::RasterImage::DecodeWorker::DecodeType) /home/mikeh/dev/mozilla/btg024/objdir-gecko-b2g18-debug-dmd/dist/include/nsError.h:1065 (0x40540b06 libxul.so+0x3f6b06) mozilla::image::RasterImage::DecodeWorker::Run() /home/mikeh/dev/mozilla/m-c/b2g18/image/src/RasterImage.cpp:3335 (0x40c9cd16 libxul.so+0x3f6d16)
(In reply to Justin Lebar [:jlebar] from comment #27) > > $ cd root/b2g/checkout > $ git remote add https://github.com/jlebar/B2G jlebar I can't get past this step: 22:10:14 ➜ btg024 git:(master) ✗ git remote add https://github.com/jlebar/B2G jlebar fatal: 'https://github.com/jlebar/B2G' is not a valid remote name 22:10:29 ➜ btg024 git:(master) ✗ git remote add https://github.com/jlebar/B2G.git jlebar fatal: 'https://github.com/jlebar/B2G.git' is not a valid remote name (I tried the second in case the .git was missing, but it didn't make a difference.)
Okay, that worked--next: # git checkout b2g-info error: pathspec 'b2g-info' did not match any file(s) known to git.
I don't know why, but some versions of git make you do "checkout jlebar/b2g-info", while others work with just "b2g-info". Sorry; I didn't mean for this to be complex!
No luck with that either: # git checkout jlebar/b2g-info error: pathspec 'jlebar/b2g-info' did not match any file(s) known to git.
# adb shell b2g-info | megabytes | NAME PID NICE USS PSS RSS VSIZE OOM_ADJ USER b2g 444 0 116.5 118.6 121.3 230.9 0 root (Preallocated a 587 1 10.5 11.1 12.4 68.5 2 app_587 Homescreen 8272 18 14.3 16.4 19.1 73.7 4 app_8272 System memory info: Total 176.6 MB Used - cache 157.1 MB B2G procs (PSS) 146.0 MB Non-B2G procs 11.1 MB Free + cache 19.5 MB Free 7.5 MB Cache 12.1 MB Low-memory killer parameters: notify_trigger 10240 KB oom_adj min_free 6 20480 KB 4 8192 KB 3 7168 KB 2 6144 KB 1 5120 KB 0 4096 KB
Flags: needinfo?(justin.lebar+bug)
Second run: # adb shell b2g-info | megabytes | NAME PID NICE USS PSS RSS VSIZE OOM_ADJ USER b2g 444 0 115.8 117.9 120.7 231.9 0 root (Preallocated a 587 1 10.5 11.1 12.4 68.5 2 app_587 Homescreen 8272 18 14.3 16.4 19.1 73.7 4 app_8272 System memory info: Total 176.6 MB Used - cache 156.5 MB B2G procs (PSS) 145.3 MB Non-B2G procs 11.2 MB Free + cache 20.1 MB Free 8.0 MB Cache 12.1 MB Low-memory killer parameters: notify_trigger 10240 KB oom_adj min_free 6 20480 KB 4 8192 KB 3 7168 KB 2 6144 KB 1 5120 KB 0 4096 KB
Hm, it's weird that the preallocated process has oom_adj 2. Maybe it's in the process of turning into some other process. You can see here that the system only has 20mb free, including the buffer cache. That's not a lot of space. So I see two bugs here: 1) The main process is using 115+mb of RAM. That's very bad. 2) The preallocated app process has oom_adj 2. It should be 6, I think. This could be bad, or it might not be a big deal. To reproduce this, all I need to do is reboot the phone and then following the steps in comment 0?
Flags: needinfo?(justin.lebar+bug)
Whiteboard: [mozilla-triage] → [mozilla-triage][MemShrink]
(In reply to Justin Lebar [:jlebar] from comment #36) > > To reproduce this, all I need to do is reboot the phone and then following > the steps in comment 0? I do it with the endurance gaiatest. rwood can help you get this going, or if he's too busy, I can probably muddle you through the setup process. :)
(You can do it manually, but in my case, the device got into the wedged state logged above after 63 iterations.)
Attachment #761754 - Attachment is obsolete: true
Attachment #761754 - Flags: feedback?(justin.lebar+bug)
Attachment #761754 - Attachment is obsolete: false
Attachment #761698 - Attachment mime type: text/x-log → text/plain
Comment on attachment 761836 [details] get_about_memory.py output, including DMD report One thing that sticks out at me in this attachment is 2.56 MB (02.40%) ── huge/string(length=9114, "data:image//png;base64,iVBORw0KG...") [131] That means that we have 131 copies of a length-9114 data URI. Someone (maybe Gaia, maybe Gecko) is probably leaking that. A 9114-char data URI is not very big, so it's likely not a full image that we're leaking. So one way to approach this bug is to try to figure out what that string is.
(In reply to Justin Lebar [:jlebar] from comment #39) > > A 9114-char data URI is not very big, so it's likely not a full image that > we're leaking. So one way to approach this bug is to try to figure out what > that string is. Looks like the first 8 characters decode to <137>PNG<13><10>. Not so useful. If we logged a bit more, perhaps we could compare it against images included in the build.
I've been trying to debug this locally, but the camera app segfaults with a null-pointer exception on trunk. So it's somewhat slow-going...
> If we logged a bit more, perhaps we could compare it against images included in the build. Indeed, we can and should log the whole thing. Bug 801780 is where we added our current long-string logging to about:memory. Bug 852010 is open for dumping the entire contents of long strings. I'm juggling a lot of things at the moment; if you're interested in helping with bug 852010, that would probably help us move forward here. At the very least, we'd be able to see what image is being leaked (simply by opening the data URI). Another thing to figure out here is the following: gfxImageSurface has a memory reporter, which is invoked under some circumstances. But DMD is seeing dark matter in some gfxImageSurface objects here, which means that the memory reporter is not being run for some gfxImageSurfaces when we do a DMD dump. Why is that? If we could figure out why the memory reporters for these gfxImageSurface objects are not being run, that might help us understand why they're building up as they are. We have bug 820248 open on a similar problem, which may or may not be related. I'm happy to keep working on this, but it would be a big help to me if you could step in, so let me know.
(In reply to Justin Lebar [:jlebar] from comment #41) > > I've been trying to debug this locally, but the camera app segfaults with a > null-pointer exception on trunk. So it's somewhat slow-going... It's probably bug 882328; there's a patch pending there.
Further to comment 45, occurrences of the different images in the memory-report: # grep -c length=9114 memory-reports 160 # grep -c length=9117 memory-reports 2 # grep -c length=44737 memory-reports 2 (Although the data for each image appears twice in the memory-report, one occurrence uses the "length=X" notation, while the other uses "length-X" notation; so the above greps are unique.) So the Marketplace app icon appears in the memory-report 160 times (or 162, considering that the 9117-byte images are also Marketplace app icons). The 44737-byte image is a Homescreen wallpaper (although, interestingly, _not_ the Homescreen wallpaper I have selected).
Cristian, it looks like the Homescreen might be leaking icon resources when switching tasks.
Flags: needinfo?(crdlc)
Hi, My main concern here is to know if this bug is reproducible when ev.me is loaded or not. Surfing on our home implementation, we revoke all icon resources after loading: https://github.com/mozilla-b2g/gaia/blob/master/apps/homescreen/js/page.js#L233 https://github.com/mozilla-b2g/gaia/blob/master/apps/homescreen/js/page.js#L240 https://github.com/mozilla-b2g/gaia/blob/master/apps/homescreen/js/page.js#L351 https://github.com/mozilla-b2g/gaia/blob/master/apps/homescreen/js/page.js#L530 What do you mean: switching tasks? I don't know that is the Homescreen wallpaper.. home doesn't define wallpaper, it is defined on system layer as far as I know. Could I take a look in some part of the home? Thanks a lot
Flags: needinfo?(crdlc) → needinfo?(mhabicher)
This memory leak isn't in the homescreen app; it's in the main process.
After killing the Homescreen app and rerunning get_about_memory.py, some of the images in the memory-report have been freed: # grep -c length=9114 memory-reports 156 # grep -c length=9117 memory-reports 0 # grep -c length=44737 memory-reports 2
What if you run get_about_memory.py --minimize?
(If the images are garbage -- and I suspect they're not -- get_about_memory.py --minimize will dump them. If --minimize doesn't dump them, then they're definitely leaked somehow.)
(In reply to Cristian Rodriguez de la Cruz (:crdlc) from comment #48) > > My main concern here is to know if this bug is reproducible when ev.me > is loaded or not. Surfing on our home implementation, we revoke all icon > resources after loading: The test that shows this problem is as follows: 1. restart b2g process 2. open Camera app, wait 30 seconds 3. switch to Gallery app, wait 30 seconds 4. switch to Camera app, wait 30 seconds 5. go to step 3 Somewhere between 60 and 90 iterations, the phone runs out of memory and the Homescreen fails to load. > What do you mean: switching tasks? I don't know that is the Homescreen > wallpaper.. home doesn't define wallpaper, it is defined on system layer as > far as I know. The about:memory report of the b2g parent process shows that it is holding onto several strings that contain "data:image/" URLs corresponding to the phone's wallpaper, and another 80 to 160 strings that contain the "data:image/" URL for the Marketplace icon. We're trying to figure out where these come from. 91.72 MB (100.0%) -- explicit ├──33.97 MB (37.03%) -- js-non-window │ ├──27.98 MB (30.51%) -- compartments │ │ ├──26.33 MB (28.70%) -- non-window-global │ │ │ ├──24.10 MB (26.28%) -- compartment([System Principal]) │ │ │ │ ├───9.85 MB (10.74%) -- gc-heap │ │ │ │ │ ├──4.17 MB (04.55%) ── unused-gc-things │ │ │ │ │ ├──3.05 MB (03.32%) -- objects │ │ │ │ │ │ ├──2.70 MB (02.94%) ── non-function │ │ │ │ │ │ └──0.35 MB (00.38%) ── function │ │ │ │ │ ├──1.78 MB (01.94%) ── strings │ │ │ │ │ └──0.85 MB (00.93%) ++ (4 tiny) │ │ │ │ ├───7.62 MB (08.31%) -- string-chars │ │ │ │ │ ├──5.10 MB (05.56%) ── non-huge │ │ │ │ │ └──2.52 MB (02.75%) ── huge/string(length=9114, "data:image//png;base64,iVBORw0KGgoAAAANSUhEUgAAADwAAAA8CAYAAAA6//NlyAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAA Hmm, I see that the icon for the Marketplace is defined in $B2G/gaia/external-apps/marketplace.firefox.com/update.webapp: "icons": { "64": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADwAAAA8CAYAAAA6/NlyAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAA2hpVFh0WE1MOmNvbS5hZG9iZS54bXAAAAAAADw/eHBhY2tldCBiZWdpbj0i77u..." }, According to |grep -rn -A 4 \"icons\" *| in $B2G/gaia/{apps, external-apps, external-dogfood-apps}, the Marketplace is the _only_ app that has its icon so-encoded. I don't suppose you know off-hand how app icons get loaded?
Flags: needinfo?(mhabicher) → needinfo?(crdlc)
I've just added traces to homescreen where we create object URLs and set the src attributes for apps and I don't see anything special for the Marketplace app. The behavior is the same than the rest of apps. After flashing the device, it loads the default icon (rocket), tries to load the icon defined at build time (application-data) and finally when mozapps API returns the info it loads the correct one if is different (theoretically the same loaded previously). In other words, I don't see any way in home's code where we could create 160 strings that contain the "data:image/" URL for the Marketplace icon. I gonna try to investigate a bit more but I don't guess that the problems is on the home although I cannot say I am sure 100% :)
Flags: needinfo?(crdlc)
Like I said in comment 49, the leak here is in the main process, not the homescreen app. One needs to look in the system app (or something else that's loaded in the main process), not the homescreen app, to have any hope of finding out what's going on here.
Agreed--leaks in any other processes would have been cleaned up when OoM killed them. It looks like the only code that parses 'manifest.webapp' is in Webapps.jsm, which I'm guessing runs in the main process?
> It looks like the only code that parses 'manifest.webapp' is in Webapps.jsm, which I'm guessing runs > in the main process? I think so.
Whiteboard: [mozilla-triage][MemShrink] → [mozilla-triage][MemShrink:P2]
The Leo engineers have hit a problem that looks very similar to this one by using marionette as part of bug 886217. The Marketplace icon seems to be leaked in the main process together with 8 other ones if I'm reading the about:memory report correctly. See the dumps in attachment 767627 [details].
> See the dumps in attachment 767627 [details]. > Main process > 316.42 MB (100.0%) -- explicit > ├──267.15 MB (84.43%) -- js-non-window > │ ├──264.43 MB (83.57%) -- compartments > │ │ ├──262.45 MB (82.94%) -- non-window-global > │ │ │ ├──259.83 MB (82.12%) -- compartment([System Principal]) > │ │ │ │ ├──199.83 MB (63.15%) -- string-chars > │ │ │ │ │ ├──163.52 MB (51.68%) -- huge > │ │ │ │ │ │ ├──105.12 MB (33.22%) ── string(length=92050, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ ├───11.68 MB (03.69%) ── string(length=9114, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ ├────9.34 MB (02.95%) ── string(length=7074, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ ├────9.34 MB (02.95%) ── string(length=7286, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ ├────9.34 MB (02.95%) ── string(length=8054, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ ├────4.67 MB (01.48%) ── string(length=2182, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ ├────4.67 MB (01.48%) ── string(length=2522, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ ├────4.67 MB (01.48%) ── string(length=3062, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ ├────4.67 MB (01.48%) ── string(length=3926, "data:image//png;base64,iVBORw0KG...") [598] > │ │ │ │ │ │ └────0.01 MB (00.00%) ── string(length=2499, "bssid // frequency // signal leve...") Ouch, that's horrible. Like you said, these appear to be different strings, as they all have different lengths.
(In reply to Justin Lebar [:jlebar] from comment #60) > > Ouch, that's horrible. Like you said, these appear to be different strings, > as they all have different lengths. If it's anything like what I saw in comment 45, the largest one is probably one of the wallpapers. Unfortunately, the amount of string we log by default is just enough to show the basic PNG header. This patch increases the amount of logged string so we can at least identify what's leaking.
> If it's anything like what I saw in comment 45, the largest one is probably one of the wallpapers. Indeed, but what's different here is that we have many copies of the big one, whereas earlier we had only a few.
(In reply to Justin Lebar [:jlebar] from comment #62) > > Indeed, but what's different here is that we have many copies of the big > one, whereas earlier we had only a few. I was thinking about that; my best guess (for now) is that leo has more memory to leak into, so their tests can run longer, allowing it to accumulate more copies.
That sounds plausible to me!
(In reply to Mike Habicher [:mikeh] from comment #63) > I was thinking about that; my best guess (for now) is that leo has more > memory to leak into, so their tests can run longer, allowing it to > accumulate more copies. Confirmed, they have more memory and since the test is run via marionette it can run for hours (IIRC they took a memory dump every 2 hours).(In reply to Mike Habicher [:mikeh] from comment #61) > Unfortunately, the amount of string we log by default is just enough to show > the basic PNG header. This patch increases the amount of logged string so we > can at least identify what's leaking. I'll ask them if they can re-run their tests with your patch applied so we get better visibility.
this memory report taken after increase 8k for logging string. (Attachment #767906 [details] [diff])
Well, one of these is a twitter logo... Have you tried get_about_memory.py --minimize? Can we check that that doesn't make these strings go away?
Okay, I got tired of doing this by hand, so here's a script. Just stick it somewhere on your path, cd into the folder with the 'memory-reports' file, and run it. It will spit out one file for each unique image in the report. Running it against the log in comment 66, I see: - five icons that look like the sun - three Twitter icons - two Facebook icons - two Wikipedia icons - a Marketplace icon - an icon that looks like the top of a gold frame, or something - and a JPEG that doesn't contain enough non-header data to decode I wonder if, like the Marketplace icon, all of these are defined in their application manifests as "data:image/" URLs.
All apps that added for request by operator is set to manifest like as "data:image/png;base64,iVBORw0K~~~~~~" This one is any clue for this issue?
(In reply to jongsoo.oh from comment #69) > All apps that added for request by operator is set to manifest like as > "data:image/png;base64,iVBORw0K~~~~~~" > > This one is any clue for this issue? Yes, it's a very important clue. So we have to check what the system app is doing with the application manifest as Mike suggested in comment 56. Considering that switching apps seem to cause the issue it might be happening somewhere in the window manager; a quick look over the code shows that it's using the app's icon, see here: https://github.com/mozilla-b2g/gaia/blob/master/apps/system/js/window_manager.js#L1271 ... and here: https://github.com/mozilla-b2g/gaia/blob/master/apps/system/js/window_manager.js#L446 Though from a superficial look that code seems to be running only at application startup and thus shouldn't be leaking the icon.
gsvelto, I see this happening on the v1.0.1 branch, and the code looks different. Can you suggest some places to look in here? https://github.com/mozilla-b2g/gaia/blob/v1.0.1/apps/system/js/window_manager.js I was thinking of commenting out the equivalent of the lines you mention in comment 70 and seeing if the leak still occurs.
If we ran this test with principal merging disabled, that will probably tell us which JSM/JS component the leak is coming from. The pref to flip is jsloader.reuseGlobal. Set it to false, e.g. in b2g/app/b2g.js. Then get a new memory report; instead of putting the strings under Compartment([System Principal]), hopefully they'll be under something else. > So we have to check what the system app is doing with the application manifest The leak is probably in something from Gecko; that's why it's under System Principal and not under the system app in about:memory. Alternatively we can just look at whatever touches manifests...
Actually, we hardcoded jsloader.reuseGlobal to true in B2G. You need to set to false in mozJSComponentLoader.cpp. Search for "reuseGlobal" and look right below that.
Also, if you test this again, please check whether get_about_memory.py --minimize gets rid of the strings, after you observe that they're there.
(In reply to Justin Lebar [:jlebar] from comment #74) > > Also, if you test this again, please check whether get_about_memory.py > --minimize gets rid of the strings, after you observe that they're there. I've done that with reports in the past, and it didn't make any difference--at least, not to the data:image strings.
jlebar, here is the latest set of memory reports. I don't see anything obvious in them, but hopefully you can make some sense of them. They were obtained with: diff --git a/b2g/app/b2g.js b/b2g/app/b2g.js --- a/b2g/app/b2g.js +++ b/b2g/app/b2g.js @@ -678,17 +678,17 @@ pref("network.activity.blipIntervalMilli // By default we want the NetworkManager service to manage Gecko's offline // status for us according to the state of Wifi/cellular data connections. // In some environments, such as the emulator or hardware with other network // connectivity, this is not desireable, however, in which case this pref // can be flipped to false. pref("network.gonk.manage-offline-status", true); -pref("jsloader.reuseGlobal", true); +pref("jsloader.reuseGlobal", false); // Enable font inflation for browser tab content. pref("font.size.inflation.minTwips", 120); // And disable it for lingering master-process UI. pref("font.size.inflation.disabledInMasterProcess", true); // Enable freeing dirty pages when minimizing memory; this reduces memory // consumption when applications are sent to the background. diff --git a/js/xpconnect/loader/mozJSComponentLoader.cpp b/js/xpconnect/loader/mozJSComponentLoader.cpp --- a/js/xpconnect/loader/mozJSComponentLoader.cpp +++ b/js/xpconnect/loader/mozJSComponentLoader.cpp @@ -450,17 +450,17 @@ mozJSComponentLoader::ReallyInit() { nsresult rv; mReuseLoaderGlobal = Preferences::GetBool("jsloader.reuseGlobal"); // XXXkhuey B2G child processes have some sort of preferences race that // results in getting the wrong value. #ifdef MOZ_B2G - mReuseLoaderGlobal = true; + // mReuseLoaderGlobal = true; #endif /* * Get the JSRuntime from the runtime svc, if possible. * We keep a reference around, because it's a Bad Thing if the runtime * service gets shut down before we're done. Bad! */
Attachment #769115 - Flags: feedback?(justin.lebar+bug)
(In reply to Mike Habicher [:mikeh] from comment #75) > (In reply to Justin Lebar [:jlebar] from comment #74) > > > > Also, if you test this again, please check whether get_about_memory.py > > --minimize gets rid of the strings, after you observe that they're there. > > I've done that with reports in the past, and it didn't make any > difference--at least, not to the data:image strings. Okay, great. Thanks!
> ├──28.82 MB (36.70%) -- js-non-window > │ ├──21.95 MB (27.96%) -- compartments > │ │ ├──20.43 MB (26.02%) -- non-window-global > │ │ │ ├──11.19 MB (14.25%) ++ (111 tiny) > │ │ │ ├───6.24 MB (07.95%) -- compartment([System Principal], resource://gre/modules/DOMRequestHelper.jsm) > │ │ │ │ ├──2.40 MB (03.06%) -- gc-heap > │ │ │ │ │ ├──1.30 MB (01.65%) ++ (5 tiny) > │ │ │ │ │ └──1.10 MB (01.41%) ── unused-gc-things > │ │ │ │ ├──2.25 MB (02.86%) -- string-chars > │ │ │ │ │ ├──1.29 MB (01.64%) ── non-huge > │ │ │ │ │ └──0.96 MB (01.22%) ── huge/string(length=9114, "data:image//png;base64,iVBORw0KGg) (***) > │ │ │ │ ├──1.43 MB (01.82%) -- objects-extra > │ │ │ │ │ ├──1.42 MB (01.81%) ── slots > │ │ │ │ │ └──0.01 MB (00.01%) ── elements > │ │ │ │ └──0.16 MB (00.21%) ++ (3 tiny) > │ │ │ ├───2.02 MB (02.57%) -- compartment([System Principal], jar:file:///system/b2g/omni.ja!/components/Webapps.js) > │ │ │ │ ├──0.70 MB (00.89%) ++ gc-heap > │ │ │ │ ├──0.57 MB (00.73%) ── string-chars/non-huge (***) > │ │ │ │ ├──0.47 MB (00.59%) ── objects-extra/slots > │ │ │ │ ├──0.25 MB (00.32%) ── cross-compartment-wrappers > │ │ │ │ ├──0.02 MB (00.02%) ── script-data > │ │ │ │ └──0.02 MB (00.02%) ── other-sundries > │ │ │ └───0.98 MB (01.25%) ++ compartment([System Principal], chrome://browser/content/shell.xul) > │ │ └───1.52 MB (01.94%) -- no-global/compartment(atoms) > │ │ ├──0.98 MB (01.24%) -- string-chars > │ │ │ ├──0.89 MB (01.14%) ── non-huge > │ │ │ └──0.09 MB (00.11%) ── huge/string(length=44737, "data:image//jpeg;base64,//9j//4AAQSkZJRgABAQEASAB) This shows 0.96mb coming from presumably multiple copies of a length-9114 string in the DOMRequestHelper.jsm compartment. That's probably the culprit. It also shows a lot of non-huge strings in Webapps.js, which may be relevant.
Maybe we're leaking DOMRequests from somewhere.
I may have figured this out. Let me give you a patch to look at.
Sorry, jlebar: this has your patch applied and the tests still OoMs. memory-reports attached.
Can you point me to instructions for running this workload locally?
This leakage problem makes a issue to stability of leo device. The image data is contained in manifest link below might make a memory leak. "data:image/png;base64,iVBORw0K~~~~~~" When we erase the upper case apps, leo get a better stablity.
Have we established that the leak itself is caused by switching from an app to another? I did some quick tests last week but - strangely enough - couldn't reproduce the issue. IMHO the first thing we should nail down is why (and where) we're reading all the app manifests; we'll probably be able to drill down to the leak from there. :jeffhwang confirmed that in their tests they had multiple apps installed with the icon being specified as a data URL in the manifest and all of those were leaked. So we're obviously touching all the manifests and even if there wasn't an actual leak I wouldn't understand why we're doing it.
> Sorry, jlebar: this has your patch applied and the tests still OoMs. memory-reports attached. Were you testing on b2g18? I discovered that, although the patch applies there, it has no effect on that branch.
(In reply to Gabriele Svelto [:gsvelto] from comment #85) > IMHO the first thing we should nail down is > why (and where) we're reading all the app manifests; we'll probably be able > to drill down to the leak from there. I agree with you it doesn't need to touch all the manifest files. But it seems it is doing that while we run Marionette tests. On the beginning of the test, we use manifest files which have inline icons inside of it. and we got a crash on every device. After, we removed the inline icons from manifest and tested it again, the result is remarkable. We only got one crashed device out of 22 so far. > we'll probably be able to drill down to the leak from there. Gabriele, (if it is doing that) do you find something why Marionette is reading a number of times for all the manifests?
Flags: needinfo?(gsvelto)
(In reply to Justin Lebar [:jlebar] from comment #86) > > Were you testing on b2g18? I discovered that, although the patch applies > there, it has no effect on that branch. Yes, I was testing on b2g18. I can retest against m-c, assumin it's stable enough to run the stress test. If you want to run the test yourself: # git clone https://github.com/rwood-moz/gaia-ui-tests.git gaiastress # cd gaiastress # git pull origin gaiastress # cd gaiatest # cp testvars_template.json bug851626.json -- edit bug851626.json to add |"acknowledged_risks": true,| to the top of the JSON object # adb forward tcp:2828 tcp:2828 # gaiatest --type b2g --address localhost:2828 --testvars bug851626.json --restart --iterations=100 --checkpoint=10 tests/endurance/test_endurance_gallery_camera.py --interations: the number of times to repeat the gallery<-->camera switch test --checkpoint: log the output of |adb shell b2g-ps| every this-number of iterations (Those steps are reconstructed from my CLI history--ping me on IRC if you run into any issues, and I'll do my best to help you sort them out.)
BTW, gaiatest will warn you of this (and give you 30s to cancel) but I'll call it out here: THE STEPS IN COMMENT 88 WILL RESET THE DATA ON YOUR DEVICE, INCLUDING PICTURES ON THE uSD CARD.
Master is totally unusable, so I guess I'll try to backport this patch to b2g18. :-/
Okay, I got my device to work on master. I can reproduce the leak using marionette, but doing the same thing manually, I can't. So maybe this is another marionette leak.
(In reply to Justin Lebar [:jlebar] from comment #91) > I can reproduce the leak using marionette, but doing the same thing > manually, I can't. So maybe this is another marionette leak. i agree, it maybe this is another marionette leak. when i am doing in using marionette, i can see the increasing icon resource on defined manifest.webapp. but, doing by manually, can't see same thing.
before marionette test. │ │ │ │ │ ├──1.03 MB (02.83%) -- huge │ │ │ │ │ │ ├──0.19 MB (00.52%) ── string(length=23510, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.16 MB (00.43%) ── string(length=18514, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.16 MB (00.43%) ── string(length=20438, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.10 MB (00.27%) ── string(length=9114, "data:image//png;base64,iVBORw0KG...") [5] │ │ │ │ │ │ ├──0.09 MB (00.26%) ── string(length=10914, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.09 MB (00.26%) ── string(length=7074, "data:image//png;base64,iVBORw0KG...") [6] │ │ │ │ │ │ ├──0.08 MB (00.21%) ── string(length=8054, "data:image//png;base64,iVBORw0KG...") [5] │ │ │ │ │ │ ├──0.05 MB (00.13%) ── string(length=4654, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.04 MB (00.11%) ── string(length=2522, "data:image//png;base64,iVBORw0KG...") [5] │ │ │ │ │ │ ├──0.04 MB (00.11%) ── string(length=3062, "data:image//png;base64,iVBORw0KG...") [5] │ │ │ │ │ │ └──0.04 MB (00.11%) ── string(length=3926, "data:image//png;base64,iVBORw0KG...") [5] after testing by marionette. │ │ │ │ │ ├──1.55 MB (03.72%) -- huge │ │ │ │ │ │ ├──0.23 MB (00.56%) ── string(length=9114, "data:image//png;base64,iVBORw0KG...") [12] │ │ │ │ │ │ ├──0.20 MB (00.49%) ── string(length=7074, "data:image//png;base64,iVBORw0KG...") [13] │ │ │ │ │ │ ├──0.19 MB (00.45%) ── string(length=23510, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.19 MB (00.45%) ── string(length=8054, "data:image//png;base64,iVBORw0KG...") [12] │ │ │ │ │ │ ├──0.16 MB (00.38%) ── string(length=18514, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.16 MB (00.38%) ── string(length=20438, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.09 MB (00.23%) ── string(length=10914, "data:image//png;base64,iVBORw0KG...") [4] │ │ │ │ │ │ ├──0.09 MB (00.23%) ── string(length=2522, "data:image//png;base64,iVBORw0KG...") [12] │ │ │ │ │ │ ├──0.09 MB (00.23%) ── string(length=3062, "data:image//png;base64,iVBORw0KG...") [12] │ │ │ │ │ │ ├──0.09 MB (00.23%) ── string(length=3926, "data:image//png;base64,iVBORw0KG...") [12] │ │ │ │ │ │ └──0.05 MB (00.11%) ── string(length=4654, "data:image//png;base64,iVBORw0KG...") [4]
Running the gallery-camera stress test on an m-c/master build, the test actually completed; though the b2g parent process had ballooned to the point where only it and the Gallery app could fit in memory at the same time. After 100 iterations, I see 205 copies of the Marketplace app icon, ~2 per iteration. (jlebar, this is without your DOMRequest fixes--I'll run that test overnight.) I know next to nothing about how marionette works, but I wonder how it could be very specifically leaking data: URI icons.
The marionette might cause to leakage of icon in manifest "data:image/png;base64,iVBORw0K~~~~~~" The marionette is handled from QA team?
Flags: needinfo?(tchung)
The original issue was reported/found by a manual test (without marionette) though, correct?
(In reply to Rob Wood [:rwood] from comment #96) > The original issue was reported/found by a manual test (without marionette) > though, correct? It's possible that the original issue manifests as a different leak than what we're seeing with marionette.
If you do $ grep data:image/png gc-edges.762.1372815764.log | cut -f 4 -d ' ' | sort | uniq -c you'll see that there are 28 copies each of two unique long png strings.
According to these GC logs, what's happening here is that we're leaking WebappsApplication objects. These objects each keep a ref to the app's manifest. The manifest keeps a ref to the icon. The icon string is not deduplicated. Therefore we leak an icon string for each WebappsApplication object we hold alive.
per comment 96, please check if the leak is reproducible when performing manually to help with narrowing down the issue.
Flags: needinfo?(tchung)
I did, comment 91. See the dependent bugs here; we have a decent idea of what's going on.
Flags: needinfo?(gsvelto)
I think FFOS might have different issues. One is the marionette leak. The inline icons are duplicated. Another is B2G process might have a leakage.(Bug 889261) We have to divide them.
Indeed. You did the right thing by filing a separate bug; it's important to have one bug for each issue so we don't conflate them.
Whiteboard: [mozilla-triage][MemShrink:P2] → [mozilla-triage][MemShrink:P2] [TD-59414]
Target Milestone: --- → 1.1 QE4 (15jul)
(In reply to Justin Lebar [:jlebar] from comment #97) > (In reply to Rob Wood [:rwood] from comment #96) > > The original issue was reported/found by a manual test (without marionette) > > though, correct? > > It's possible that the original issue manifests as a different leak than > what we're seeing with marionette. We knew that Bug 886217 is same issue with Bug 851626. But Bug 886217 is a marionette issue. so I reopen the Bug 886217 Let's handle to Bug 886217 for icon duplicate issue in mariomette
Attachment #769115 - Flags: feedback?(justin.lebar+bug)
(In reply to Mike Habicher [:mikeh] PTO until Aug 5 from comment #94) > Running the gallery-camera stress test on an m-c/master build, the test > actually completed; though the b2g parent process had ballooned to the point > where only it and the Gallery app could fit in memory at the same time. > > After 100 iterations, I see 205 copies of the Marketplace app icon, ~2 per > iteration. > > (jlebar, this is without your DOMRequest fixes--I'll run that test > overnight.) > > I know next to nothing about how marionette works, but I wonder how it could > be very specifically leaking data: URI icons. Is there a copy of this stress test that I could use to see if this is potentially Marionette-related?
perfect, thanks
(In reply to Jonathan Griffin (:jgriffin) from comment #105) > (In reply to Mike Habicher [:mikeh] PTO until Aug 5 from comment #94) > > Running the gallery-camera stress test on an m-c/master build, the test > > actually completed; though the b2g parent process had ballooned to the point > > where only it and the Gallery app could fit in memory at the same time. > > > > > > I know next to nothing about how marionette works, but I wonder how it could > > be very specifically leaking data: URI icons. > > Is there a copy of this stress test that I could use to see if this is > potentially Marionette-related? rwood pointed the test out to me. This is a pretty simple test, so I'm going to construct an orangutan version of it, which should help us determine if the problem is related to Marionette or not. If it is, we then have to figure out if it's any of the Gaia API's that Marionette calls that's involved.
I ran this test on mozilla-b2g18/v1-train on an inari (I don't have a leo). After 60 iterations, the gc-edges file shows hundreds of copies of the Marketplace icon and the icon for the HostStubTest app. I doubt this has anything to do with core Marionette, but it may involve the gaiatest atoms. I will write a version of the test which doesn't use them and see if this persists.
I made a simplified version of this test using pure Marionette, without gaiatest, and found that it does not leak application icons. So, the problem isn't in Marionette per se, but either in gaiatest, or in the Gaia API's it uses. I'll narrow it down further.
Attached file test_camera.py
Pure Marionette version of camera/gallery test, without gaiatest
I've narrowed this down to one of the WebAPI's that gaiatest uses. Adding or removing this line of code (without using the return value anywhere) is enough to trigger or resolve the icon leak: let appsReq = navigator.mozApps.mgmt.getAll(); I'll make as simple a test as I can manage to reproduce this, then file a separate bug.
> Adding or removing this line of code (without using the return value anywhere) is enough > to trigger or resolve the icon leak: Yeesh, that's really bad, if you don't have to use the return value. Please cc Fabrice on the new bug and mark it as [MemShrink].
(In reply to Justin Lebar [:jlebar] from comment #113) > > Adding or removing this line of code (without using the return value anywhere) is enough > > to trigger or resolve the icon leak: > > Yeesh, that's really bad, if you don't have to use the return value. > > Please cc Fabrice on the new bug and mark it as [MemShrink]. Also - file the bug in Core --> DOM: Apps specifically.
Depends on: 897684
This isn't a 1.01 regression and has been stagnating for a while. Do we really need to block on this?
blocking-b2g: leo+ → leo?
I'm hopeful that bug 900221 will fix this. But that shouldn't have much bearing either way on the blocking status.
Agree with comment 115.
blocking-b2g: leo? → -
Rob, Jonathan: are we still seeing this issue with the endurance tests?
Flags: needinfo?(rwood)
Flags: needinfo?(jgriffin)
Unfortunately I was able to reproduce this crash three times today by running the gallery_camera endurance test on Inari with the latest master build. Each time b2g crashed before the 45th iteration of switching between the gallery and camera.
Flags: needinfo?(rwood)
Flags: needinfo?(jgriffin)
Okay, I wanted to see if this was reproducible manually, and it is--kind of. After approximately 110 screen taps (or about 55 full camera-gallery cycles) the screen on the test Inari I borrowed from rwood went black--it looks like the backlight turned off as well. When I plugged in a USB cable to pull the logcat, I ran 'adb shell b2g-ps' which reported: APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME b2g root 111 1 173924 67024 ffffffff 4001b4e0 S /system/b2g/b2g Usage app_335 335 111 66312 20792 ffffffff 401384e0 S /system/b2g/plugin-container Homescreen app_343 343 111 68416 24296 ffffffff 4005c4e0 S /system/b2g/plugin-container Camera app_410 410 111 88596 32316 ffffffff 40113abc R /system/b2g/plugin-container Gallery app_435 435 111 78064 26908 ffffffff 401094e0 S /system/b2g/plugin-container (Preallocated a root 452 111 63168 17184 ffffffff 4001b4e0 S /system/b2g/plugin-container ...all of the processes still active! None had crashed. After some time, I noticed that the button backlight turned on. Pressing the power button once turned the button backlight off; pressing it again caused the lockscreen to come up properly! Unlocking the device took me back into the camera; pressing the gallery button caused the screen to go black again, as above. Again, I was able to unlock the device, this time landing in the gallery. Hitting the camera button caused the screen to go black and _this_ time the b2g parent process crashed.
blocking-b2g: - → koi?
I'm going to remove the dependency on bug 897684, since I can reproduce this issue manually.
Status: NEW → ASSIGNED
No longer depends on: 897684
With the following b2g18 build, I am unable to observe any memory leaks after 150 manually-triggered Camera<-->Gallery cycles: - gecko: b2g18:3655fe17b75b - gaia: 763757e133a4fa8b0cb49f35a8e6b6700c0bf345 ==> BASELINE: 14:00:58 ➜ gaia adb shell b2g-ps APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME b2g root 965 1 212352 65400 ffffffff 4007a430 S /system/b2g/b2g Homescreen app_1033 1033 965 74708 28728 ffffffff 40102430 S /system/b2g/plugin-container Camera app_1080 1080 965 69828 25000 ffffffff 400bd430 S /system/b2g/plugin-container Gallery app_1113 1113 965 89748 29300 ffffffff 400c8430 S /system/b2g/plugin-container (Preallocated a root 1140 965 63316 21308 ffffffff 40060430 S /system/b2g/plugin-container ==> AFTER 50 APP CYCLES (or 100 app switches): 14:36:48 ➜ btg030_hamachi-b2g18 git:(master) ✗ adb shell b2g-ps APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME b2g root 139 1 187332 59028 ffffffff 400c5430 S /system/b2g/b2g Usage app_365 365 139 66452 25348 ffffffff 4005d430 S /system/b2g/plugin-container Homescreen app_435 435 139 71640 29536 ffffffff 400bb430 S /system/b2g/plugin-container Camera app_502 502 139 77708 26452 ffffffff 40076430 S /system/b2g/plugin-container Gallery app_600 600 139 69584 26700 ffffffff 400ea430 S /system/b2g/plugin-container (Preallocated a root 701 139 63312 21608 ffffffff 40055430 S /system/b2g/plugin-container ==> AFTER 150 APP CYCLES (or 300 app switches): 14:56:45 ➜ btg030_hamachi-b2g18 git:(master) ✗ adb shell b2g-ps APPLICATION USER PID PPID VSIZE RSS WCHAN PC NAME b2g root 139 1 187524 57740 ffffffff 400c5430 S /system/b2g/b2g Usage app_365 365 139 66452 23092 ffffffff 4005d430 S /system/b2g/plugin-container Homescreen app_435 435 139 71640 26768 ffffffff 400bb430 S /system/b2g/plugin-container Camera app_502 502 139 78776 25680 ffffffff 40076430 S /system/b2g/plugin-container Gallery app_600 600 139 69584 25140 ffffffff 400ea430 S /system/b2g/plugin-container (Preallocated a root 701 139 63312 19232 ffffffff 40055430 S /system/b2g/plugin-container
if 26 is affected, 27 is likely affected as well
Based on previous discussions, moving this to koi+
blocking-b2g: koi? → koi+
Hema Any progress on this bug since it hasn't been commented on since 9/24
Flags: needinfo?(hkoka)
Hoping to get some help from perf team on this bug (mikeh is busy with the latency bugs on camera);
Flags: needinfo?(mlee)
Kyle, is anyone on the MemShrink team able to help with this issue?
Flags: needinfo?(mlee) → needinfo?(khuey)
Priority: -- → P2
Target Milestone: 1.1 QE4 (15jul) → 1.2 C3(Oct25)
Someone else will be looking into this.
Assignee: mhabicher → nobody
Flags: needinfo?(hkoka)
Status: ASSIGNED → NEW
(In reply to Hema Koka [:hema] from comment #124) > Based on previous discussions, moving this to koi+ Where did this discussion happen? I don't see why the rationale from comment 115 no longer applies.
Flags: needinfo?(hkoka)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #129) > (In reply to Hema Koka [:hema] from comment #124) > > Based on previous discussions, moving this to koi+ > > Where did this discussion happen? I don't see why the rationale from > comment 115 no longer applies. Agreed. Moving to koi?, as we need cut back on what we're blocking on for the release at this point anyways. We've shipped two releases with this bug already.
blocking-b2g: koi+ → koi?
Hema Koka deleted the linked story in Pivotal Tracker
Moving it out of koi? -- If we start seeing this frequently, we can renominate it back (from comment 122)
blocking-b2g: koi? → ---
MemShrink doesn't have bandwidth to look into non-koi+ bugs ourselves right now. If the situation changes here we can reevaluate.
Flags: needinfo?(khuey)
Flags: needinfo?(hkoka)
mikeh: does this still reproduce, or can we close this?
Flags: needinfo?(mhabicher)
Rob, do we still see this endurance issue?
Flags: needinfo?(mhabicher) → needinfo?(rwood)
This test is now obsolete. In 1.4 and master/2.0 you can no longer switch back to the gallery from within the camera by a single button press. In 1.4/2.0 you need to take a photo, view the preview, click a menu, and then choose to switch to gallery from the preview. I will close this bug as wontfix. If/when I update the endurance test for 2.0 and this issue is seen again, I will open a new bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(rwood)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: