Closed Bug 780437 Opened 13 years ago Closed 13 years ago

Apparent child-process OOMs can kill the parent process, despite oom_adj scores which should preclude this

Categories

(Firefox OS Graveyard :: General, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-basecamp:+)

RESOLVED WORKSFORME
blocking-basecamp +

People

(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)

References

Details

(Whiteboard: [see comment 33])

Bug 768832 lets us set different oom_adj and nice values for the master process, foreground processes, and background processes. I didn't put any effort into choosing good values for these parameters; I just made them up. They're currently set to: oom_adj nice master 0 -1 fg 1 0 bg 2 10 I'm not sure how to empirically or analytically choose optimal values here. Michael, do you have any thoughts?
Those oom values seem decent guesses for now. I think this is probably something that could be shook out during stability testing. But the exact values for oom_adj probably don't matter much though so long as we start at zero and increase by at least one per level, as we can configure the LMK for a device as needed. From what I recall of the LMK defaults, the 0,1,2 values should work fairly well. The other daemons spawned by init run at -16 by default. I don't have anything to say about those nice values at the moment.
The lowmemkiller has memory thresholds for oom_adj classes chosen from android work. We should leverage that by making our oom_adj values take on the android equivalent when possible. The values are configured in init.rc and interpreted by lowmemkiller.c in the kernel.
I'm not sure if this is the right init.rc, but b2g/build/target/board/vbox_x86/init.rc has setprop ro.FOREGROUND_APP_ADJ 0 setprop ro.VISIBLE_APP_ADJ 1 setprop ro.PERCEPTIBLE_APP_ADJ 2 setprop ro.HEAVY_WEIGHT_APP_ADJ 3 setprop ro.SECONDARY_SERVER_ADJ 4 setprop ro.BACKUP_APP_ADJ 5 setprop ro.HOME_APP_ADJ 6 setprop ro.HIDDEN_APP_MIN_ADJ 7 setprop ro.EMPTY_APP_ADJ 15 Which suggests that our oom_adj values are reasonable.
not blocking, nice to have.
blocking-basecamp: --- → +
Did you mean to set it to blocking+, then?
what jlebar said.
blocking-basecamp: + → -
So we've switched over to oom_score_adj and the current settings are set to: pref("hal.processPriorityManager.gonk.masterOomScoreAdjust", 0); pref("hal.processPriorityManager.gonk.foregroundOomScoreAdjust", 67); pref("hal.processPriorityManager.gonk.backgroundOomScoreAdjust", 400); I modified b2g-ps to print out the oom_score and oom_score_adj. In talking with Justin, he was concerned that the foreground process could get killed, and I believe that to be the case. I was able to reproduce exactly that scenario, although it took a while. I was sitting at the following: Application OOM ADJ USER PID PPID VSIZE RSS WCHAN PC NAME b2g 460 0 root 106 1 235012 91888 ffffffff 400ca330 S /system/b2g/b2g Homescreen 249 67 app_0 7323 106 108848 34232 ffffffff 40040330 S /system/b2g/plugin-container (App) 490 400 app_0 7350 106 55704 16928 ffffffff 400c9330 S /system/b2g/plugin-container and then launched the Maps app Application OOM ADJ USER PID PPID VSIZE RSS WCHAN PC NAME b2g 465 0 root 106 1 236036 92748 ffffffff 400c96ec R /system/b2g/b2g Homescreen 239 67 app_0 7323 106 122312 32188 ffffffff 40040330 S /system/b2g/plugin-container Maps 196 67 app_0 7350 106 62936 23952 ffffffff 410adee6 R /system/b2g/plugin-container The Maps app popped up a dialog asking if it was ok to use geolocation, and presumably right around the same time, the prelaunch app started. This killed the Maps app. Application OOM ADJ USER PID PPID VSIZE RSS WCHAN PC NAME b2g 465 0 root 106 1 237060 92840 ffffffff 400c96ec R /system/b2g/b2g Homescreen 190 67 app_0 7323 106 122312 23084 ffffffff 40040330 S /system/b2g/plugin-container (App) 490 400 app_0 7430 106 55704 16928 ffffffff 4002cd60 R /system/b2g/plugin-container
Requesting blocking per comment 8.
blocking-basecamp: - → ?
Another problem with this is that apparently the homescreen never gets marked as background. (Its ADJ is always 67, even when the Maps app is showing.)
Actually, it does. I think here, Homescreen was in the foreground because the Maps app died. I've also noticed that when launching an app, the previous foreground and the new foreground apps both have oom_score_adj of 67 momentarily and then the old foreground eventually switches to 400. I'd expect that we should give the b2g app an oom_score_adj of -1000 (never kill), the foreground an oom_score_adj of 0, and the background an oom_score_adj of 500 or 600
QA Contact: dhylands
> I've also noticed that when launching an app, the previous foreground and the new foreground apps > both have oom_score_adj of 67 momentarily and then the old foreground eventually switches to 400. There's a grace period, which we can tweak. See dom/ipc/ProcessPriorityManager.cpp. It's a horrible hack. :(
Assignee: nobody → dhylands
QA Contact: dhylands
Whiteboard: [LOE:S]
Killing apps that you *just* launched is bad so let's get this fixed.
blocking-basecamp: ? → +
Blocks: 808517
Dave, do you still intend to work on this bug in the near future? This is important as we get more testers, since I expect to see more bugs like bug 808517, and I want to be able to rule out poor oom_adj values as the cause. If you're busy with other things, we can probably find someone else to do this.
Justin, I'll unassign myself. I got a bunch of new stuff at the work week.
Assignee: dhylands → nobody
Considering this bug would be related to low-memory-killer threshold (adj and minfree) when low memory, maybe we should take below values in to consideration. As I know, most partners would tweak these values (according to platform and /proc/meminfo) for optimization (and reducing # of kernel invoked oom-killer?). /sys/module/lowmemorykiller/parameters/adj /sys/module/lowmemorykiller/parameters/minfree my reference phone with b2g ROM has 0,1,6 for adj and 256,1024,2048 for minfree. And it has 0,1,2,4,9,15 for adj and 3676,4971,6266,8314,9610,11448 for minfree with android shipping ROM.
> Considering this bug would be related to low-memory-killer threshold (adj and minfree) when low > memory, maybe we should take below values in to consideration. Yes, we also need to tweak those. We're already setting them in GonkHal.
Just a note to say that the b2g-ps with the --oom option was just merged into gonk-misc
I ran $ adb shell $(cat ps_hack) where ps_hack is for x in /proc/*; do if [ -e $x/oom_adj ]; then echo "$x $(cat $x/oom_score_adj) $(cat $x/cmdline)" fi done This showed that critical phone services run with oom_score_adj -941. These processes are: > /init > /system/bin/vold > /system/bin/fakeperm > /system/bin/rilproxy > /system/bin/netd > /system/bin/rild > /system/bin/drmserver > /system/bin/mediaserver > /system/bin/dbus-daemon--system--nofork > /system/bin/installd > /system/bin/keystore/data/misc/keystore > /system/bin/akmd8962 > /sbin/adbd > /system/bin/sh > /system/bin/qmuxd > /system/bin/netmgrd > /system/bin/ATFWD-daemon > /system/bin/wpa_supplicant > /system/bin/debuggerd > /sbin/ueventd > /system/bin/servicemanager There are a bunch of other system processes running with oom_score_adj 0: > root 2 0 0 0 c00cb25c 00000000 S kthreadd > root 3 2 0 0 c00b73f8 00000000 S ksoftirqd/0 > root 6 2 0 0 c00c65a8 00000000 S khelper > root 7 2 0 0 c00c65a8 00000000 S suspend_sys_syn > root 8 2 0 0 c00c65a8 00000000 S suspend > root 9 2 0 0 c0113074 00000000 S sync_supers > root 10 2 0 0 c0113d6c 00000000 S bdi-default > root 11 2 0 0 c00c65a8 00000000 S kblockd > root 12 2 0 0 c02eff78 00000000 S khubd > root 13 2 0 0 c00c65a8 00000000 S l2cap > root 14 2 0 0 c00c65a8 00000000 S a2mp > root 15 2 0 0 c00c65a8 00000000 S modem_notifier > root 16 2 0 0 c00c65a8 00000000 S smd_channel_clo > root 19 2 0 0 c00c65a8 00000000 S rpcrouter > root 20 2 0 0 c00c65a8 00000000 S rpcrotuer_smd_x > root 21 2 0 0 c00627b0 00000000 S krpcserversd > root 23 2 0 0 c0061714 00000000 S kadspd > root 24 2 0 0 c00614a8 00000000 D voicememo_rpc > root 25 2 0 0 c010df90 00000000 S kswapd0 > root 26 2 0 0 c015b184 00000000 S fsnotify_mark > root 27 2 0 0 c00c65a8 00000000 S crypto > root 40 2 0 0 c00c65a8 00000000 S mdp_dma_wq > root 41 2 0 0 c00c65a8 00000000 S mdp_vsync_wq > root 42 2 0 0 c00c65a8 00000000 S mdp_hist_wq > root 43 2 0 0 c00c65a8 00000000 S mdp_pipe_ctrl_w > root 45 2 0 0 c00c65a8 00000000 S kgsl-3d0 > root 46 2 0 0 c02c4d38 00000000 S mtdblock0 > root 47 2 0 0 c02c4d38 00000000 S mtdblock1 > root 48 2 0 0 c02c4d38 00000000 S mtdblock2 > root 49 2 0 0 c02c4d38 00000000 S mtdblock3 > root 50 2 0 0 c02c4d38 00000000 S mtdblock4 > root 51 2 0 0 c02c4d38 00000000 S mtdblock5 > root 52 2 0 0 c02c4d38 00000000 S mtdblock6 > root 53 2 0 0 c02c4d38 00000000 S mtdblock7 > root 54 2 0 0 c02c4d38 00000000 S mtdblock8 > root 55 2 0 0 c02c4d38 00000000 S mtdblock9 > root 56 2 0 0 c02c4d38 00000000 S mtdblock10 > root 57 2 0 0 c02c4d38 00000000 S mtdblock11 > root 58 2 0 0 c02c4d38 00000000 S mtdblock12 > root 59 2 0 0 c02c4d38 00000000 S mtdblock13 > root 60 2 0 0 c02c4d38 00000000 S mtdblock14 > root 61 2 0 0 c02c4d38 00000000 S mtdblock15 > root 62 2 0 0 c02c4d38 00000000 S mtdblock16 > root 63 2 0 0 c02c4d38 00000000 S mtdblock17 > root 70 2 0 0 c00c65a8 00000000 S k_rmnet_mux_wor > root 71 2 0 0 c00c65a8 00000000 S f_mtp > root 72 2 0 0 c0331000 00000000 S file-storage > root 74 2 0 0 c00c65a8 00000000 S diag_wq > root 75 2 0 0 c00c65a8 00000000 S diag_cntl_wq > root 76 2 0 0 c00c65a8 00000000 S atmel_wq > root 78 2 0 0 c00614a8 00000000 D krtcclntd > root 79 2 0 0 c0063dd8 00000000 D krtcclntcbd > root 80 2 0 0 c00c65a8 00000000 S kfmradio > root 81 2 0 0 c00614a8 00000000 D kbatteryclntd > root 82 2 0 0 c0063dd8 00000000 D kbatteryclntcbd > root 83 2 0 0 c00c65a8 00000000 S iewq > root 84 2 0 0 c00cb36c 00000000 D kinteractiveup > root 85 2 0 0 c00c65a8 00000000 S mmcsdcc_host1 > root 86 2 0 0 c00c65a8 00000000 S mmcsdcc_host2 > root 87 2 0 0 c00c65a8 00000000 S binder > root 88 2 0 0 c051d05c 00000000 S krfcommd > root 89 2 0 0 c00614a8 00000000 D khsclntd > root 92 2 0 0 c03b7678 00000000 S mmcqd/0 > root 93 2 0 0 c01e8cdc 00000000 S yaffs-bg-1 > root 95 2 0 0 c01e8cdc 00000000 S yaffs-bg-1 > root 96 2 0 0 c01e8cdc 00000000 S yaffs-bg-1 > root 97 2 0 0 c01e8cdc 00000000 S yaffs-bg-1 > root 164 2 0 0 c00c65a8 00000000 S k_gserial > root 165 2 0 0 c00c65a8 00000000 S k_gsmd > root 357 2 0 0 c00c65a8 00000000 S cfg80211 > root 365 2 0 0 c00c65a8 00000000 S ath6kl > root 495 2 0 0 c00614a8 00000000 D audmgr_rpc > root 3631 2 0 0 c014a874 00000000 S flush-31:6 > root 19354 2 0 0 c00c7c08 00000000 S kworker/u:2 > root 19452 2 0 0 c005f160 00000000 D kworker/u:1 > root 21027 2 0 0 c00c7c08 00000000 S kworker/0:1 > root 31382 2 0 0 c00c7c08 00000000 S kworker/0:3 > root 31387 2 0 0 c00c7c08 00000000 S kworker/u:0 > root 31393 2 0 0 c03b4a78 00000000 S ksdioirqd/mmc1 My suspicion is that we don't want to profligately kill these system processes. If that's true, the lowest oom_score_adj we should use for any Gecko process is 0. So I'm going to try setting the [master, fg, bg] oom_score_adj values to [0, 500, 1000] and see if that fixes bug 808517. Suggestions are welcome, since I don't claim that I know what I'm doing here.
If you use an oom_score_adj of 1000 then there will be no preference amongst the background processes, since the max oom_score is 1000. You may want to look at something like 700 or 800 for the background process so that you get some variation based on the amount of memory that they're using. I've never seen the b2g process get killed in favor of the others, so picking a fairly low master adjustment seems like a good thing as well (based on measurements I've taken, I wouldn't go much above 100 for foreground). With a recent build, I noticed that Messages and Cost Control both seem to retain their "foreground" priority which makes large memory apps like maps more likely to be killed, even when in the foreground
> I've never seen the b2g process get killed in favor of the others Bug 808517 sounds like that's what's happening.
> Bug 808517 sounds like that's what's happening. To be clear, bug 808517 and its ilk is the only reason I'm looking at this bug at all. > With a recent build, I noticed that Messages and Cost Control both seem to retain their "foreground" > priority which makes large memory apps like maps more likely to be killed, even when in the > foreground cjones tells me we will not be running these apps in the background as we currently do, so I've been ignoring them. > You may want to look at something like 700 or 800 for the background process so that you get some > variation based on the amount of memory that they're using. I see; thanks.
(In reply to Justin Lebar [:jlebar] from comment #22) > > Bug 808517 sounds like that's what's happening. > > To be clear, bug 808517 and its ilk is the only reason I'm looking at this > bug at all. > > > With a recent build, I noticed that Messages and Cost Control both seem to retain their "foreground" > > priority which makes large memory apps like maps more likely to be killed, even when in the > > foreground > > cjones tells me we will not be running these apps in the background as we > currently do, so I've been ignoring them. > Messages is tracked in 801351. And Cost Control is beeing worked on. (Salva is there already a bug number to remove it?) > > You may want to look at something like 700 or 800 for the background process so that you get some > > variation based on the amount of memory that they're using. > > I see; thanks.
I can't test this right now because my phone's wifi isn't working. I'll try to get some help, but if nobody has any ideas, I'll put this off and see if tip is more functional in a few days.
Blocks: 806641
Blocks: 804498
Giving to jlebar to reassign
Assignee: nobody → justin.lebar+bug
(In reply to Dave Hylands [:dhylands] from comment #25) > Giving to jlebar to reassign I thought we agreed on IRC that the first person done with their current work would grab this? Or did I misunderstand?
I thought we were talking about the bug where the main process was OOMing. Probably related.
Summary: Tweak oom_adj and nice values for master/foreground/background processes → Tweak oom_adj and nice values for master/foreground/background processes (particularly to ensure that the parent process doesn't die on OOM, if that's actually happening)
Hey all, what needs to happen here? Is this still a release-blocking piece of work?
I or someone else needs to look through the bugs which this blocks and figure out if and why we're killing the main process. Those bugs are blockers, so inasmuch as this is the suspected fix, this is a blocker. I'd rather not un-nom this bug, because that just means that landing an eventual fix here will take many days longer.
Priority: -- → P1
This sounds like a potentially long-running bug because of the amount of investigation needed. Justin, if you can't take this on almost immediately, we should find someone else and get them on it. Also, what is the target milestone for this issue?
> Justin, if you can't take this on almost immediately, we should find someone else and get > them on it. I'd love some assistance; I'm still busy with works in progress.
Whiteboard: [LOE:S]
After reviewing the lowmem killer code, the algorithim basically boils down to the following: 1 - Find the process with the largest oom_adj value 2 - Amongst the processes which share the largest oom_adj value, pick the one with the largest RSS So, from this we can conclude that the main process will never be selected over a content process (this agrees with all of my observations) The actual value of the oom_adj doesn't actually matter that much, it really just acts as a sort of "group control" So all that really matters is the relative order of the oom_adj and not the absolute value. From this, I think we can conclude that no change is actually required. We may wish to add one of more extra categories if we decide we want to impose some ordering amongst say foreground processes (i.e. make dialer have a lower oom_adj than other foreground). The only other change I can think of is that for an app which is transitioning from foreground to background, we currently defer the change of oom_adj for a time period. We should probably change it immediately to some intermediate value between foreground and background, and change it to background after a timeout. That way, if you bring a large app to foreground, if it needs to, it will kill the app that was just swapped out. With the current scheme, it the need to kill an app occurs during the window where the outgoing and incoming apps both have foreground, it will be more likely to pick the incoming app if it has a larger RSS than the outgoing app.
> From this, I think we can conclude that no change is actually required. I'd been using this bug to track the dependencies, which are cases when the main process dies due to what appears to be OOM caused by the child process. I've looked at the kernel code and concur with your analysis. But /something/ is going on here. At the very least, perhaps all these dependent bugs aren't OOMs. Or perhaps the kernel isn't quick enough about killing the child process and kills the main process too, in an OOM situation. I'll update the bug summary to more accurately reflect what I've been using it to track.
Summary: Tweak oom_adj and nice values for master/foreground/background processes (particularly to ensure that the parent process doesn't die on OOM, if that's actually happening) → Apparent child-process OOMs can kill the parent process, despite oom_adj scores which should preclude this
Whiteboard: [see comment 33]
So, if the lowmemory killer is killing the main process, it should be leaving some evidence in the dmesg log (the select and send sigkill messages). If those aren't present, then its dying for some other reason.
No longer blocks: 806641
All of the dependencies here are resolved as either not OOMs or not reproducible.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.