Closed Bug 889261 Opened 11 years ago Closed 11 years ago

[B2G][LEO] Never decrease memory of B2G after killed child process by LMK or by User.

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
blocker

Tracking

(blocking-b2g:leo+, firefox25 fixed, b2g18 fixed, b2g-v1.1hd fixed)

RESOLVED FIXED
1.1 QE5
blocking-b2g leo+
Tracking Status
firefox25 --- fixed
b2g18 --- fixed
b2g-v1.1hd --- fixed

People

(Reporter: jeremykimleo, Assigned: justin.lebar+bug)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(11 files)

leo is doing stability test by quality engineers. currently, because of Bug 886217, leo removes default icon of external apps (by offering operator) on defined "manifest.webapp". but, leo has a crash problem because of OOM. the crash problems didn't means b2g side, it make a panic of kernel or MODEM. precondition : 1. display > screen timeout is "never" STR> 1. do marionette test to check stability 2. test ordering is following as. call -> browser -> calendar -> alarm -> camera -> contacts -> email -> FM radio -> gallery -> music -> after testing all apps, infinitly loop for a long time. 3. the each apps never close after testing. after done testing apps, those will go background. the purpose of this test is to check stability when it is OOM. and also, this test triggers a OOM status. and LMK will kill a child process ordering by minfree. but, it is wondering why b2g processor never decline memory's usage after killed child process by LMK? it means the memory usage of b2g increase by limitation in progress test and it is caused kernel's panic.(i thinks this caused a fetal crash of modem.) when b2g is doing GC(Garbage Collection)?
PID Vss Rss Pss Uss cmdline /memory-reports/oom/ 1/procrank 153 343596K 304264K 303942K 303744K /system/b2g/b2g /memory-reports/oom/ 5/procrank 153 353572K 316816K 313920K 311156K /system/b2g/b2g /memory-reports/oom/ 9/procrank 153 360796K 324040K 323478K 323004K /system/b2g/b2g /memory-reports/oom/21/procrank 153 364752K 327996K 327820K 327728K /system/b2g/b2g
blocking-b2g: --- → leo?
Target Milestone: --- → 1.1 QE5
Summary: [B2G][LEO] When OOM, Never decrease memory of B2G after killed child process by LMK. → [B2G][LEO] When OOM, Never decrease memory of B2G after killed child process by LMK or by user.
Summary: [B2G][LEO] When OOM, Never decrease memory of B2G after killed child process by LMK or by user. → [B2G][LEO] Never decrease memory of B2G after killed child process by LMK or by User.
Whiteboard: [MemShrink]
We're leaking strings. > 607.87 MB (100.0%) -- explicit > ├──405.53 MB (66.71%) -- js-non-window > │ ├──400.47 MB (65.88%) -- compartments > │ │ ├──394.57 MB (64.91%) -- non-window-global > │ │ │ ├──387.06 MB (63.67%) -- compartment([System Principal]) > │ │ │ │ ├──145.59 MB (23.95%) -- string-chars > │ │ │ │ │ ├──145.57 MB (23.95%) ── non-huge [2] > │ │ │ │ │ └────0.02 MB (00.00%) ── huge/string(length=3473, "bssid // frequency // signal level // flags //) > │ │ │ │ ├──124.59 MB (20.50%) -- objects-extra > │ │ │ │ │ ├──123.70 MB (20.35%) ── slots [2] > │ │ │ │ │ └────0.89 MB (00.15%) ── elements [2] > │ │ │ │ ├──113.07 MB (18.60%) -- gc-heap > │ │ │ │ │ ├───56.73 MB (09.33%) -- objects > │ │ │ │ │ │ ├──55.55 MB (09.14%) ── non-function [2] > │ │ │ │ │ │ └───1.18 MB (00.19%) ── function [2] > │ │ │ │ │ ├───42.89 MB (07.06%) ── strings [2] > │ │ │ │ │ ├───11.06 MB (01.82%) ── unused-gc-things [2] > │ │ │ │ │ └────2.40 MB (00.39%) ++ (4 tiny) > │ │ │ │ └────3.82 MB (00.63%) ++ (6 tiny) > │ │ │ └────7.51 MB (01.23%) ++ (11 tiny) > │ │ └────5.91 MB (00.97%) ++ no-global > │ └────5.06 MB (00.83%) ++ (2 tiny) > ├──160.34 MB (26.38%) ── heap-unclassified This could be the same as bug 851626, which also manifests as a string leak. There we observe leaking huge strings, but whether or not the strings are huge will depend on what the particular app manifests on your phone look like. Otherwise we need gc/cc logs to figure this out. You can get the logs via tools/get_gc_cc_log.py. But make sure you update your B2G tree first (with git fetch && git merge; repo sync is not sufficient), otherwise the tool will not work.
> We're leaking strings. Oh, that's in the main process.
Could you introduce for heap-unclassified? heap-unclassified might be more and more big during testing.
Flags: needinfo?(justin.lebar+bug)
> Could you introduce for heap-unclassified? I'm not sure what you mean. If you mean, can you explain what heap-unclassified is, I can do better: If you load the memory report in Firefox nightly (open about:memory and open the memory-report file) and then hover your mouse over any one of the lines, it will describe exactly what that entry records.
Flags: needinfo?(justin.lebar+bug)
But I could not check the inside of heap-unclassified. It might be a last depth.
we have a concern that b2g process memory increases more than more in proceeding test. it is increased by killing child-process by LMK(i guess it is also killing app by user) STR #1> 1) run app from first icon of homescreen. 2) load wait 10 seconds. 3) go homescreen by home key 4) do test 1)~3) by end icon of homescreen (leo has a 23 app on homescreen including operator apps) 5) go to step 1) The Memory Of Device #1 after doing STR #1 | The Memory Of Device #2 after doing STR #1 VSS RSS PSS USS | VSS RSS PSS USS 1Hour : 103404 66648 58438 57856 | 104460 67704 59277 58548 6H : 127340 90580 84115 82804 | 126424 89668 82784 81944 12H : 159888 123132 116760 115572 | 161564 124808 118482 117500 18H : 221956 185200 179558 178256 | 215284 178528 172344 171032 the difference of memory between start and end(after 18Hour) is very big. i think if it is test more, the memory of b2g will increase by Comment 2. because the cycle of one test will run 23 app. so the test procedure become toward the end, LMK has to kill child process to run new app. i think the killing of child process by LMK makes b2g more bigger of consume memory or some garbage of killed app is piled up on b2g process. so, we have a attention of heap-unclassified memory.
Attached file dmd of b2g-process
So, we are inspected leak report by using DMD tools Attachment #772513 [details] we are find the some leak caused by leo team through this and those leaked items was cleared. ======================================================================= Unreported: ~3,114 blocks in stack trace record 1 of 1,035 ~12,745,602 bytes (~12,745,602 requested / ~0 slop) 11.14% of the heap (11.14% cumulative); 14.19% of unreported (14.19% cumulative) Allocated at malloc /home001/jinho.hwang/workSpace/B2G_Leo/B2G/gecko/memory/build/replace_malloc.c:152 (0x4018c2ae libmozglue.so+0x42ae) moz_xmalloc /home001/jinho.hwang/workSpace/B2G_Leo/B2G/gecko/memory/mozalloc/mozalloc.cpp:55 (0x410bd266 libxul.so+0xf19266) GonkGPSGeolocationProvider::NmeaCallback(long long, char const*, int) (0x42728cd2 lib_b2g_location.so+0x11cd2) (can't find lib) (0x42f3b28c libloc_eng.so+0x928c) (can't find lib) __thread_entry /home001/jinho.hwang/workSpace/B2G_Leo/B2G/bionic/libc/bionic/pthread.c:210 (0x400500ec libc.so+0x130ec) pthread_create /home001/jinho.hwang/workSpace/B2G_Leo/B2G/bionic/libc/bionic/pthread.c:352 (0x4004fc40 libc.so+0x12c40) ======================================================================= Unreported: ~1,790 blocks in stack trace record 4 of 1,035 ~7,326,470 bytes (~7,326,470 requested / ~0 slop) 6.40% of the heap (37.42% cumulative); 8.15% of unreported (47.66% cumulative) Allocated at malloc /home001/jinho.hwang/workSpace/B2G_Leo/B2G/gecko/memory/build/replace_malloc.c:152 (0x4018c2ae libmozglue.so+0x42ae) moz_xmalloc /home001/jinho.hwang/workSpace/B2G_Leo/B2G/gecko/memory/mozalloc/mozalloc.cpp:55 (0x410bd266 libxul.so+0xf19266) GonkGPSGeolocationProvider::SvStatusCallback(GpsSvStatus*) (0x42728dd8 lib_b2g_location.so+0x11dd8) (can't find lib) (0x42a193fa gps.default.so+0x13fa) (can't find lib) (0x42f3b1b6 libloc_eng.so+0x91b6) (can't find lib) __thread_entry /home001/jinho.hwang/workSpace/B2G_Leo/B2G/bionic/libc/bionic/pthread.c:210 (0x400500ec libc.so+0x130ec) pthread_create /home001/jinho.hwang/workSpace/B2G_Leo/B2G/bionic/libc/bionic/pthread.c:352 (0x4004fc40 libc.so+0x12c40) =======================================================================
the other except Comment 9 among of Attachment #772513 [details] those two items charges the half(50%) of total at this time. it may be handled by some process, but, because some process was already killed by LMK, does it leaked on b2g as garbage? Unreported: ~2,782 blocks in stack trace record 2 of 1,035 ~11,386,726 bytes (~11,386,726 requested / ~0 slop) 9.95% of the heap (21.09% cumulative); 12.67% of unreported (26.86% cumulative) Allocated at realloc /B2G/gecko/memory/build/replace_malloc.c:192 (0x4018c1d0 libmozglue.so+0x41d0) moz_realloc /B2G/gecko/memory/mozalloc/mozalloc.cpp:97 (0x40e10a9e libxul.so+0xc6ca9e) Pickle::Resize(unsigned int) /B2G/gecko/ipc/chromium/src/base/pickle.cc:531 (0x40c3a3ae libxul.so+0xac23ae) Pickle::BeginWrite(unsigned int, unsigned int) /B2G/gecko/ipc/chromium/src/base/pickle.cc:420 (0x41634626 libxul.so+0xac2626) Pickle::WriteBytes(void const*, int, unsigned int) /B2G/gecko/ipc/chromium/src/base/pickle.cc:453 (0x41634662 libxul.so+0xac2662) void IPC::WriteParam<unsigned int>(IPC::Message*, unsigned int const&) /B2G/gecko/ipc/chromium/src/chrome/common/ipc_message_utils.h:75 (0x415088aa libxul.so+0x9968aa) IPC::ParamTraits<nsAString_internal>::Write(IPC::Message*, nsAString_internal const&) /B2G/objdir-gecko/dist/include/ipc/IPCMessageUtils.h:282 (0x415088dc libxul.so+0x9968dc) IPC::ParamTraits<int>::Write(IPC::Message*, int const&) /B2G_Leo/B2G/gecko/ipc/chromium/src/chrome/common/ipc_message_utils.h:134 (0x4152c578 libxul.so+0x9ba578) ~nsAString_internal /B2G/objdir-gecko/dist/include/nsTSubstring.h:85 (0x414e7906 libxul.so+0x975906) nsVoidArray::Count() const /B2G/objdir-gecko/dist/include/nsVoidArray.h:37 (0x40c2c06c libxul.so+0xa8806c) nsTHashtable<nsObserverList>::GetEntry(char const*) const /B2G/objdir-gecko/dist/include/nsTHashtable.h:148 (0x40c2c14c libxul.so+0xa8814c) ~nsAString_internal /B2G/objdir-gecko/dist/include/nsTSubstring.h:85 (0x4128c96c libxul.so+0x71a96c) mozilla::system::nsVolumeService::BroadcastVolume(nsAString_internal const&) /B2G/objdir-gecko/dist/include/nsError.h:1065 (0x4128c894 libxul.so+0x71a894) Unreported: ~2,777 blocks in stack trace record 3 of 1,035 ~11,366,261 bytes (~11,366,261 requested / ~0 slop) 9.93% of the heap (31.02% cumulative); 12.65% of unreported (39.51% cumulative) Allocated at realloc /B2G/gecko/memory/build/replace_malloc.c:192 (0x4018c1d0 libmozglue.so+0x41d0) moz_realloc /B2G/gecko/memory/mozalloc/mozalloc.cpp:97 (0x40e10a9e libxul.so+0xc6ca9e) Pickle::Resize(unsigned int) /B2G/gecko/ipc/chromium/src/base/pickle.cc:531 (0x40c3a3ae libxul.so+0xac23ae) Pickle::BeginWrite(unsigned int, unsigned int) /B2G/gecko/ipc/chromium/src/base/pickle.cc:420 (0x41634626 libxul.so+0xac2626) Pickle::WriteBytes(void const*, int, unsigned int) //B2G/gecko/ipc/chromium/src/base/pickle.cc:453 (0x41634662 libxul.so+0xac2662) IPC::ParamTraits<nsAString_internal>::Write(IPC::Message*, nsAString_internal const&) /B2G/objdir-gecko/dist/include/ipc/IPCMessageUtils.h:276 (0x415088cc libxul.so+0x9968cc) IPC::ParamTraits<int>::Write(IPC::Message*, int const&) /B2G/gecko/ipc/chromium/src/chrome/common/ipc_message_utils.h:134 (0x4152c578 libxul.so+0x9ba578) ~nsAString_internal /B2G/objdir-gecko/dist/include/nsTSubstring.h:85 (0x414e7906 libxul.so+0x975906) nsVoidArray::Count() const /B2G/objdir-gecko/dist/include/nsVoidArray.h:37 (0x40c2c06c libxul.so+0xa8806c) nsTHashtable<nsObserverList>::GetEntry(char const*) const /B2G/objdir-gecko/dist/include/nsTHashtable.h:148 (0x40c2c14c libxul.so+0xa8814c) ~nsAString_internal /B2G/objdir-gecko/dist/include/nsTSubstring.h:85 (0x4128c96c libxul.so+0x71a96c) mozilla::system::nsVolumeService::BroadcastVolume(nsAString_internal const&) /B2G/objdir-gecko/dist/include/nsError.h:1065 (0x4128c894 libxul.so+0x71a894)
This issue will decrease the available memory. Finally user has to restart the phone because no app could work.
blocking-b2g: leo? → leo+
If the destination process is killed by LMK, the IPC message might not free in Comment 10 log file.
Like you said in bug 851626 comment 102, we should not conflate potentially separate issues. So even though you have two bugs which result in the same thing (the main process running out of memory), I'd appreciate if you could file a separate bug for the second STR. > we are find the some leak caused by leo team through this and those leaked items was > cleared. That's great! Is the relevant source code in a public repository? If so, could you please point us to the fix you made? The pickle thing from comment 10 is known, but we have not had success figuring out what's going on, unfortunately. See bug 885158 (you can try it to see if it fixes your problem, but I'd guess not, based on our previous tests).
Jeremy, I've updated get_about_memory.py so that it pulls GC/CC logs. These logs should give us more information about the strings that we appear to be leaking here. You can get this new version of get_about_memory.py by running git fetch && git merge in your root B2G repository; repo sync is not sufficient.
Whiteboard: [MemShrink] → [MemShrink:P2]
>> That's great! Is the relevant source code in a public repository? >> If so, could you please point us to the fix you made? this was fixed at some file of commercial RIL side. those files didn't relevant a public repository. we are already thinking leo has 2 problems. - the first is bug 851626 - and also, it has a relevant Bug 889990 - WebappsApplication objects should share memory - the second is this one. - and also, it is same Bug 880940 - ~~~~ b2g memory leak, homescreen can't launch - after eliminate encoded icon on manifest. there has a increase memory on b2g process. - as DMD report, we pay attention to increased pickle's memory on b2g side.
Whiteboard: [MemShrink:P2] → [MemShrink]
oh, sorry, it is mistake.
Whiteboard: [MemShrink] → [MemShrink:P2]
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
I don't think this is a necessarily duplicate. In bug 880940, we did not see a large number of small strings. There, we saw large strings instead. It's quite possible that you have different manifests in this test, and that whatever icons you have here fall under the large-string threshold. But before dup'ing this bug, I'd like to make sure that's actually what's going on. That's why I keep asking for a GC/CC log. I would really appreciate if you could collect one, so we could tell if this bug is actually a dupe.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
We use the pre-installed apps and operator requested apps. The icon duplicate problem is not reproduced in case of the manual test. We hope the icon issue has to be handled in Bug 886217 This bug want to erase the leakage from IPC message(pickle) like Comment 12
> This bug want to erase the leakage from IPC message(pickle) like Comment 12 With memory bugs, it's really important to distinguish between "overall" bugs and "specific" bugs. An overall bug says "when I do X, the phone runs out of memory." A specific bug says "we're leaking memory in this particular way; we should fix it." This is an overall bug, not a specific bug. We should not change this bug into a specific bug. The reason this is important is that fixing an overall bug often requires fixing many specific bugs. In the past we have tried what you are asking us to do in comment 19: We've said "this overall bug is actually about this other specific bug." But the problem we have run into is that we fix the specific issue, but it often does not fix the overall issue. Then the bug in bugzilla is in a weird state: Should the bugzilla bug be closed? Do we open a new bug for the remaining overall problem? It's very confusing. This overall bug has many sub-problems. The pickle bug is one problem, but it is not the only one. Fixing the pickle issue will not fix this overall bug. So if we morph this bug into being about the pickle issue, we will be very confused. The more time we spend confused on bugs, the less time we will have to fix them. Bug 880940 is also an overall bug; the pickle memory issue is also just one part of that bug. In contrast, bug 885158 was a specific bug that tried to fix the pickle issue. Unfortunately it did not succeed in fixing the overall bugs
If you cannot collect a GC/CC log for us here, I may need to close this bug as incomplete. This bug is marked as blocking, but we cannot make progress without your assistance.
i attached as your request. in my case, after update get_about_memory.py as your guide, it didn't taken rightly those log. so, i triggered by manually. $> echo -n "memory report" > /data/loca/debug_info_trigger $> echo -n "gc log" > /data/loca/debug_info_trigger this log presents that b2g can't launch homescreen.
sorry, cc log was missing. so i re-attached.
Thanks. Unfortunately the memory report here looks completely different from the one from comment 0. All of the memory usage here is in heap-unclassified, whereas before it was largely in strings. > 210.79 MB (100.0%) -- explicit > ├──151.26 MB (71.76%) ── heap-unclassified We'd need a DMD report in order to understand this; a gc/cc log isn't going to help, unfortunately. It's quite possible that the high heap-unclassified is mostly or entirely due to the pickle issue. Maybe you're no longer leaking strings after the fix to the RIL? That would be great, if so.
(In reply to Justin Lebar [:jlebar] from comment #20) I'm sorry to hasty comment 19. Please ignore it.
Could you use the previous attached DMD report in "dmd of b2g-process"? In that log, the GPS issue is fixed.(Comment 9) but pickle issue is existed utill now.(Comment 10) We could upload the new DMD file tomorrow.
Yeah, until we get a new report to confirm it, we can assume that the remaining problem here is the pickle issue.
jongsoo.oh, in comment #13 jlebar suggested that you try a fix from bug 885158 to see if this pickle leak problem goes away. Are you using that fix in your builds or have you tried this patch?
Flags: needinfo?(zzongsoo)
Assignee: nobody → doug.turner
bug 885158 patch is already applied. Some case is fixed by patch but others make the pickle leakage continuously.
Flags: needinfo?(zzongsoo)
Sometimes the channel might not be closed that the process are killed by LMK. In that case, the messages are stack up continuously in the channel. It means that pickle memory leakage is increased. But we do not find why upper case is occured.
waiting on the DMD per comment #26. If there are any STR (even inconsistent str) that show this problem outside of the test harness, please notify us.
Flags: needinfo?(zzongsoo)
+ inculue patch of bug 885158 STR #1> 1) run app from first icon of homescreen. 2) load wait 10 seconds. 3) go homescreen by home key 4) do test 1)~3) by end icon of homescreen (leo has a 23 app on homescreen including operator apps) 5) go to step 1) this devices 1 was running for 12Hour.
Same STR of Comment #32 this devices 2 was running for 12Hour.
we do test with orangutan tools for STR of Comment #32 For LMK work, we installed more app. The home has a 22 apps. The apps are launched continuously using round robin method. I attached the script file To run the Orangutan script: 1.Push orng executable to /system/bin and change its access: $> adb push orng /system/xbin $> adb shell chmod 755 /system/xbin/orng 2.Push the Orangutan script file to the sdcard: $> adb push runningApps.dat /sdcard/ 3.Run the script iteratively by doing: $> while true; do ./run.sh; sleep 10; done;
as our mentions Comment #30, the channel closing is doubted for pickle memory leakage. We try to test using the attached temp patch.(patch_V1.patch) The channel is removed if the messages in queue are exceed the 100. We think exceed 100 is an abnormal case just check. we think the leakages decrease more than original one.
Are you able to run the test for much longer with this patch applied? I'm working on instrumentation which will help us figure out where these leaked, unclosed channels are coming from. Stay tuned...
Attached image Graph
Attached file DMD dump 1
Attached file DMD dump 2
(In reply to Justin Lebar [:jlebar] from comment #36) > Are you able to run the test for much longer with this patch applied? Yes, I'm able to run much longer with the patch applied. Please, see the attachment 774486 [details] . with the patch applied, I couldn't see the unexpected sudden memory incrementation. before the patch, there were a lot of memory consumption by Messages on output queue in IPC channel. Please see the attachment 774487 [details], attachment 774488 [details]. These DMD dump file show us there is memory leakage in IPC channel. In DMD dump 1, the record2 shows us only 8.4MB blocks for Messages. Unreported: ~2,054 blocks in stack trace record 2 of 1,051 ~8,407,022 bytes (~8,407,022 requested / ~0 slop) after time passed, we can see more than 32MB memory blocks in DMD dump 2. So, I checked the number of messages in output queue in every channel. Some times the size of output queue becomes bigger than 800. It's unusual case. But it happened on some other channels too but not at all. I think that... if the channel didn't get a Close() call after LMK kills an application, b2g still try to send messages to the channel. that's why the pipe is under waiting state and b2g just keep the messages in output queue until the pipe can be writeable or the callback OnFileCanWriteWithoutBlocking() called. > I'm working on instrumentation which will help us figure out where these > leaked, unclosed channels are coming from. Stay tuned... If you find some thing please let me know.
Depends on: 893012
Depends on: 893172
See bug 893172 for a patch you can try.
(In reply to Justin Lebar [:jlebar] from comment #41) > See bug 893172 for a patch you can try. Thank you. we will check it.
Flags: needinfo?(zzongsoo)
Depends on: 893865
Depends on: 893857
Assignee: doug.turner → justin.lebar+bug
> Created attachment 774429 [details] > Orangutan (To Test, Leo team use this.) > > we do test with orangutan tools for STR of Comment #32 > For LMK work, we installed more app. The home has a 22 apps. We're going to try to reproduce this locally, but it would be helpful to know exactly which apps you installed, and so on, insofar as that is possible.
(In reply to jongsoo.oh from comment #42) > (In reply to Justin Lebar [:jlebar] from comment #41) > > See bug 893172 for a patch you can try. > > Thank you. we will check it. Jongsoo, Did the patch in 893172 work for you ? Also I believe Justin is looking for more info on the apps ayou installed based on comment #43.
Flags: needinfo?(zzongsoo)
> Also I believe Justin is looking for more info on the apps ayou installed based on comment #43. We've reproduced this locally and done a lot of testing over the past two weeks. We're not doing any more testing at the moment, and we're not looking into the specific apps Leo is using, at the moment.
We applied (In reply to bhavana bajaj [:bajaj] (on vacation until 08/14/2013) from comment #44) > (In reply to jongsoo.oh from comment #42) > > (In reply to Justin Lebar [:jlebar] from comment #41) > > > See bug 893172 for a patch you can try. > > > > Thank you. we will check it. > > Jongsoo, Did the patch in 893172 work for you ? Also I believe Justin is > looking for more info on the apps ayou installed based on comment #43. We applied the 893172 and tested for a long time. The memory is not increased continuously. Sorry for late response and Thank you.
Flags: needinfo?(zzongsoo)
Let's call this fixed, then!
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: