Closed
Bug 780437
Opened 13 years ago
Closed 13 years ago
Apparent child-process OOMs can kill the parent process, despite oom_adj scores which should preclude this
Categories
(Firefox OS Graveyard :: General, defect, P1)
Tracking
(blocking-basecamp:+)
RESOLVED
WORKSFORME
blocking-basecamp | + |
People
(Reporter: justin.lebar+bug, Assigned: justin.lebar+bug)
References
Details
(Whiteboard: [see comment 33])
Bug 768832 lets us set different oom_adj and nice values for the master process, foreground processes, and background processes.
I didn't put any effort into choosing good values for these parameters; I just made them up. They're currently set to:
oom_adj nice
master 0 -1
fg 1 0
bg 2 10
I'm not sure how to empirically or analytically choose optimal values here. Michael, do you have any thoughts?
Assignee | ||
Updated•13 years ago
|
Blocks: b2g-e10s-work
Comment 1•13 years ago
|
||
Those oom values seem decent guesses for now. I think this is probably something that could be shook out during stability testing. But the exact values for oom_adj probably don't matter much though so long as we start at zero and increase by at least one per level, as we can configure the LMK for a device as needed. From what I recall of the LMK defaults, the 0,1,2 values should work fairly well. The other daemons spawned by init run at -16 by default. I don't have anything to say about those nice values at the moment.
The lowmemkiller has memory thresholds for oom_adj classes chosen from android work. We should leverage that by making our oom_adj values take on the android equivalent when possible. The values are configured in init.rc and interpreted by lowmemkiller.c in the kernel.
Assignee | ||
Comment 3•13 years ago
|
||
I'm not sure if this is the right init.rc, but b2g/build/target/board/vbox_x86/init.rc has
setprop ro.FOREGROUND_APP_ADJ 0
setprop ro.VISIBLE_APP_ADJ 1
setprop ro.PERCEPTIBLE_APP_ADJ 2
setprop ro.HEAVY_WEIGHT_APP_ADJ 3
setprop ro.SECONDARY_SERVER_ADJ 4
setprop ro.BACKUP_APP_ADJ 5
setprop ro.HOME_APP_ADJ 6
setprop ro.HIDDEN_APP_MIN_ADJ 7
setprop ro.EMPTY_APP_ADJ 15
Which suggests that our oom_adj values are reasonable.
Assignee | ||
Comment 6•13 years ago
|
||
Did you mean to set it to blocking+, then?
Comment 8•13 years ago
|
||
So we've switched over to oom_score_adj and the current settings are set to:
pref("hal.processPriorityManager.gonk.masterOomScoreAdjust", 0);
pref("hal.processPriorityManager.gonk.foregroundOomScoreAdjust", 67);
pref("hal.processPriorityManager.gonk.backgroundOomScoreAdjust", 400);
I modified b2g-ps to print out the oom_score and oom_score_adj. In talking with Justin, he was concerned that the foreground process could get killed, and I believe that to be the case.
I was able to reproduce exactly that scenario, although it took a while.
I was sitting at the following:
Application OOM ADJ USER PID PPID VSIZE RSS WCHAN PC NAME
b2g 460 0 root 106 1 235012 91888 ffffffff 400ca330 S /system/b2g/b2g
Homescreen 249 67 app_0 7323 106 108848 34232 ffffffff 40040330 S /system/b2g/plugin-container
(App) 490 400 app_0 7350 106 55704 16928 ffffffff 400c9330 S /system/b2g/plugin-container
and then launched the Maps app
Application OOM ADJ USER PID PPID VSIZE RSS WCHAN PC NAME
b2g 465 0 root 106 1 236036 92748 ffffffff 400c96ec R /system/b2g/b2g
Homescreen 239 67 app_0 7323 106 122312 32188 ffffffff 40040330 S /system/b2g/plugin-container
Maps 196 67 app_0 7350 106 62936 23952 ffffffff 410adee6 R /system/b2g/plugin-container
The Maps app popped up a dialog asking if it was ok to use geolocation, and presumably right around the same time, the prelaunch app started.
This killed the Maps app.
Application OOM ADJ USER PID PPID VSIZE RSS WCHAN PC NAME
b2g 465 0 root 106 1 237060 92840 ffffffff 400c96ec R /system/b2g/b2g
Homescreen 190 67 app_0 7323 106 122312 23084 ffffffff 40040330 S /system/b2g/plugin-container
(App) 490 400 app_0 7430 106 55704 16928 ffffffff 4002cd60 R /system/b2g/plugin-container
Assignee | ||
Comment 10•13 years ago
|
||
Another problem with this is that apparently the homescreen never gets marked as background. (Its ADJ is always 67, even when the Maps app is showing.)
Comment 11•13 years ago
|
||
Actually, it does. I think here, Homescreen was in the foreground because the Maps app died.
I've also noticed that when launching an app, the previous foreground and the new foreground apps both have oom_score_adj of 67 momentarily and then the old foreground eventually switches to 400.
I'd expect that we should give the b2g app an oom_score_adj of -1000 (never kill), the foreground an oom_score_adj of 0, and the background an oom_score_adj of 500 or 600
QA Contact: dhylands
Assignee | ||
Comment 12•13 years ago
|
||
> I've also noticed that when launching an app, the previous foreground and the new foreground apps
> both have oom_score_adj of 67 momentarily and then the old foreground eventually switches to 400.
There's a grace period, which we can tweak. See dom/ipc/ProcessPriorityManager.cpp. It's a horrible hack. :(
Updated•13 years ago
|
Assignee: nobody → dhylands
QA Contact: dhylands
Whiteboard: [LOE:S]
Comment 13•13 years ago
|
||
Killing apps that you *just* launched is bad so let's get this fixed.
blocking-basecamp: ? → +
Assignee | ||
Comment 14•13 years ago
|
||
Dave, do you still intend to work on this bug in the near future? This is important as we get more testers, since I expect to see more bugs like bug 808517, and I want to be able to rule out poor oom_adj values as the cause.
If you're busy with other things, we can probably find someone else to do this.
Comment 15•13 years ago
|
||
Justin, I'll unassign myself. I got a bunch of new stuff at the work week.
Assignee: dhylands → nobody
Comment 16•13 years ago
|
||
Considering this bug would be related to low-memory-killer threshold (adj and minfree) when low memory, maybe we should take below values in to consideration. As I know, most partners would tweak these values (according to platform and /proc/meminfo) for optimization (and reducing # of kernel invoked oom-killer?).
/sys/module/lowmemorykiller/parameters/adj
/sys/module/lowmemorykiller/parameters/minfree
my reference phone with b2g ROM has 0,1,6 for adj and 256,1024,2048 for minfree. And it has 0,1,2,4,9,15 for adj and 3676,4971,6266,8314,9610,11448 for minfree with android shipping ROM.
Assignee | ||
Comment 17•13 years ago
|
||
> Considering this bug would be related to low-memory-killer threshold (adj and minfree) when low
> memory, maybe we should take below values in to consideration.
Yes, we also need to tweak those. We're already setting them in GonkHal.
Comment 18•13 years ago
|
||
Just a note to say that the b2g-ps with the --oom option was just merged into gonk-misc
Assignee | ||
Comment 19•13 years ago
|
||
I ran
$ adb shell $(cat ps_hack)
where ps_hack is
for x in /proc/*; do
if [ -e $x/oom_adj ]; then
echo "$x $(cat $x/oom_score_adj) $(cat $x/cmdline)"
fi
done
This showed that critical phone services run with oom_score_adj -941. These processes are:
> /init
> /system/bin/vold
> /system/bin/fakeperm
> /system/bin/rilproxy
> /system/bin/netd
> /system/bin/rild
> /system/bin/drmserver
> /system/bin/mediaserver
> /system/bin/dbus-daemon--system--nofork
> /system/bin/installd
> /system/bin/keystore/data/misc/keystore
> /system/bin/akmd8962
> /sbin/adbd
> /system/bin/sh
> /system/bin/qmuxd
> /system/bin/netmgrd
> /system/bin/ATFWD-daemon
> /system/bin/wpa_supplicant
> /system/bin/debuggerd
> /sbin/ueventd
> /system/bin/servicemanager
There are a bunch of other system processes running with oom_score_adj 0:
> root 2 0 0 0 c00cb25c 00000000 S kthreadd
> root 3 2 0 0 c00b73f8 00000000 S ksoftirqd/0
> root 6 2 0 0 c00c65a8 00000000 S khelper
> root 7 2 0 0 c00c65a8 00000000 S suspend_sys_syn
> root 8 2 0 0 c00c65a8 00000000 S suspend
> root 9 2 0 0 c0113074 00000000 S sync_supers
> root 10 2 0 0 c0113d6c 00000000 S bdi-default
> root 11 2 0 0 c00c65a8 00000000 S kblockd
> root 12 2 0 0 c02eff78 00000000 S khubd
> root 13 2 0 0 c00c65a8 00000000 S l2cap
> root 14 2 0 0 c00c65a8 00000000 S a2mp
> root 15 2 0 0 c00c65a8 00000000 S modem_notifier
> root 16 2 0 0 c00c65a8 00000000 S smd_channel_clo
> root 19 2 0 0 c00c65a8 00000000 S rpcrouter
> root 20 2 0 0 c00c65a8 00000000 S rpcrotuer_smd_x
> root 21 2 0 0 c00627b0 00000000 S krpcserversd
> root 23 2 0 0 c0061714 00000000 S kadspd
> root 24 2 0 0 c00614a8 00000000 D voicememo_rpc
> root 25 2 0 0 c010df90 00000000 S kswapd0
> root 26 2 0 0 c015b184 00000000 S fsnotify_mark
> root 27 2 0 0 c00c65a8 00000000 S crypto
> root 40 2 0 0 c00c65a8 00000000 S mdp_dma_wq
> root 41 2 0 0 c00c65a8 00000000 S mdp_vsync_wq
> root 42 2 0 0 c00c65a8 00000000 S mdp_hist_wq
> root 43 2 0 0 c00c65a8 00000000 S mdp_pipe_ctrl_w
> root 45 2 0 0 c00c65a8 00000000 S kgsl-3d0
> root 46 2 0 0 c02c4d38 00000000 S mtdblock0
> root 47 2 0 0 c02c4d38 00000000 S mtdblock1
> root 48 2 0 0 c02c4d38 00000000 S mtdblock2
> root 49 2 0 0 c02c4d38 00000000 S mtdblock3
> root 50 2 0 0 c02c4d38 00000000 S mtdblock4
> root 51 2 0 0 c02c4d38 00000000 S mtdblock5
> root 52 2 0 0 c02c4d38 00000000 S mtdblock6
> root 53 2 0 0 c02c4d38 00000000 S mtdblock7
> root 54 2 0 0 c02c4d38 00000000 S mtdblock8
> root 55 2 0 0 c02c4d38 00000000 S mtdblock9
> root 56 2 0 0 c02c4d38 00000000 S mtdblock10
> root 57 2 0 0 c02c4d38 00000000 S mtdblock11
> root 58 2 0 0 c02c4d38 00000000 S mtdblock12
> root 59 2 0 0 c02c4d38 00000000 S mtdblock13
> root 60 2 0 0 c02c4d38 00000000 S mtdblock14
> root 61 2 0 0 c02c4d38 00000000 S mtdblock15
> root 62 2 0 0 c02c4d38 00000000 S mtdblock16
> root 63 2 0 0 c02c4d38 00000000 S mtdblock17
> root 70 2 0 0 c00c65a8 00000000 S k_rmnet_mux_wor
> root 71 2 0 0 c00c65a8 00000000 S f_mtp
> root 72 2 0 0 c0331000 00000000 S file-storage
> root 74 2 0 0 c00c65a8 00000000 S diag_wq
> root 75 2 0 0 c00c65a8 00000000 S diag_cntl_wq
> root 76 2 0 0 c00c65a8 00000000 S atmel_wq
> root 78 2 0 0 c00614a8 00000000 D krtcclntd
> root 79 2 0 0 c0063dd8 00000000 D krtcclntcbd
> root 80 2 0 0 c00c65a8 00000000 S kfmradio
> root 81 2 0 0 c00614a8 00000000 D kbatteryclntd
> root 82 2 0 0 c0063dd8 00000000 D kbatteryclntcbd
> root 83 2 0 0 c00c65a8 00000000 S iewq
> root 84 2 0 0 c00cb36c 00000000 D kinteractiveup
> root 85 2 0 0 c00c65a8 00000000 S mmcsdcc_host1
> root 86 2 0 0 c00c65a8 00000000 S mmcsdcc_host2
> root 87 2 0 0 c00c65a8 00000000 S binder
> root 88 2 0 0 c051d05c 00000000 S krfcommd
> root 89 2 0 0 c00614a8 00000000 D khsclntd
> root 92 2 0 0 c03b7678 00000000 S mmcqd/0
> root 93 2 0 0 c01e8cdc 00000000 S yaffs-bg-1
> root 95 2 0 0 c01e8cdc 00000000 S yaffs-bg-1
> root 96 2 0 0 c01e8cdc 00000000 S yaffs-bg-1
> root 97 2 0 0 c01e8cdc 00000000 S yaffs-bg-1
> root 164 2 0 0 c00c65a8 00000000 S k_gserial
> root 165 2 0 0 c00c65a8 00000000 S k_gsmd
> root 357 2 0 0 c00c65a8 00000000 S cfg80211
> root 365 2 0 0 c00c65a8 00000000 S ath6kl
> root 495 2 0 0 c00614a8 00000000 D audmgr_rpc
> root 3631 2 0 0 c014a874 00000000 S flush-31:6
> root 19354 2 0 0 c00c7c08 00000000 S kworker/u:2
> root 19452 2 0 0 c005f160 00000000 D kworker/u:1
> root 21027 2 0 0 c00c7c08 00000000 S kworker/0:1
> root 31382 2 0 0 c00c7c08 00000000 S kworker/0:3
> root 31387 2 0 0 c00c7c08 00000000 S kworker/u:0
> root 31393 2 0 0 c03b4a78 00000000 S ksdioirqd/mmc1
My suspicion is that we don't want to profligately kill these system processes. If that's true, the lowest oom_score_adj we should use for any Gecko process is 0.
So I'm going to try setting the [master, fg, bg] oom_score_adj values to [0, 500, 1000] and see if that fixes bug 808517.
Suggestions are welcome, since I don't claim that I know what I'm doing here.
Comment 20•13 years ago
|
||
If you use an oom_score_adj of 1000 then there will be no preference amongst the background processes, since the max oom_score is 1000.
You may want to look at something like 700 or 800 for the background process so that you get some variation based on the amount of memory that they're using.
I've never seen the b2g process get killed in favor of the others, so picking a fairly low master adjustment seems like a good thing as well (based on measurements I've taken, I wouldn't go much above 100 for foreground).
With a recent build, I noticed that Messages and Cost Control both seem to retain their "foreground" priority which makes large memory apps like maps more likely to be killed, even when in the foreground
Assignee | ||
Comment 21•13 years ago
|
||
> I've never seen the b2g process get killed in favor of the others
Bug 808517 sounds like that's what's happening.
Assignee | ||
Comment 22•13 years ago
|
||
> Bug 808517 sounds like that's what's happening.
To be clear, bug 808517 and its ilk is the only reason I'm looking at this bug at all.
> With a recent build, I noticed that Messages and Cost Control both seem to retain their "foreground"
> priority which makes large memory apps like maps more likely to be killed, even when in the
> foreground
cjones tells me we will not be running these apps in the background as we currently do, so I've been ignoring them.
> You may want to look at something like 700 or 800 for the background process so that you get some
> variation based on the amount of memory that they're using.
I see; thanks.
Comment 23•13 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #22)
> > Bug 808517 sounds like that's what's happening.
>
> To be clear, bug 808517 and its ilk is the only reason I'm looking at this
> bug at all.
>
> > With a recent build, I noticed that Messages and Cost Control both seem to retain their "foreground"
> > priority which makes large memory apps like maps more likely to be killed, even when in the
> > foreground
>
> cjones tells me we will not be running these apps in the background as we
> currently do, so I've been ignoring them.
>
Messages is tracked in 801351.
And Cost Control is beeing worked on. (Salva is there already a bug number to remove it?)
> > You may want to look at something like 700 or 800 for the background process so that you get some
> > variation based on the amount of memory that they're using.
>
> I see; thanks.
Assignee | ||
Comment 24•13 years ago
|
||
I can't test this right now because my phone's wifi isn't working. I'll try to get some help, but if nobody has any ideas, I'll put this off and see if tip is more functional in a few days.
Assignee | ||
Comment 26•13 years ago
|
||
(In reply to Dave Hylands [:dhylands] from comment #25)
> Giving to jlebar to reassign
I thought we agreed on IRC that the first person done with their current work would grab this? Or did I misunderstand?
Comment 27•13 years ago
|
||
I thought we were talking about the bug where the main process was OOMing. Probably related.
Assignee | ||
Updated•13 years ago
|
Summary: Tweak oom_adj and nice values for master/foreground/background processes → Tweak oom_adj and nice values for master/foreground/background processes (particularly to ensure that the parent process doesn't die on OOM, if that's actually happening)
Comment 28•13 years ago
|
||
Hey all, what needs to happen here? Is this still a release-blocking piece of work?
Assignee | ||
Comment 29•13 years ago
|
||
I or someone else needs to look through the bugs which this blocks and figure out if and why we're killing the main process. Those bugs are blockers, so inasmuch as this is the suspected fix, this is a blocker. I'd rather not un-nom this bug, because that just means that landing an eventual fix here will take many days longer.
Updated•13 years ago
|
Priority: -- → P1
Comment 30•13 years ago
|
||
This sounds like a potentially long-running bug because of the amount of investigation needed. Justin, if you can't take this on almost immediately, we should find someone else and get them on it.
Also, what is the target milestone for this issue?
Assignee | ||
Comment 31•13 years ago
|
||
> Justin, if you can't take this on almost immediately, we should find someone else and get
> them on it.
I'd love some assistance; I'm still busy with works in progress.
Whiteboard: [LOE:S]
Comment 32•13 years ago
|
||
After reviewing the lowmem killer code, the algorithim basically boils down to the following:
1 - Find the process with the largest oom_adj value
2 - Amongst the processes which share the largest oom_adj value, pick the one with the largest RSS
So, from this we can conclude that the main process will never be selected over a content process (this agrees with all of my observations)
The actual value of the oom_adj doesn't actually matter that much, it really just acts as a sort of "group control"
So all that really matters is the relative order of the oom_adj and not the absolute value.
From this, I think we can conclude that no change is actually required.
We may wish to add one of more extra categories if we decide we want to impose some ordering amongst say foreground processes (i.e. make dialer have a lower oom_adj than other foreground).
The only other change I can think of is that for an app which is transitioning from foreground to background, we currently defer the change of oom_adj for a time period. We should probably change it immediately to some intermediate value between foreground and background, and change it to background after a timeout.
That way, if you bring a large app to foreground, if it needs to, it will kill the app that was just swapped out. With the current scheme, it the need to kill an app occurs during the window where the outgoing and incoming apps both have foreground, it will be more likely to pick the incoming app if it has a larger RSS than the outgoing app.
Assignee | ||
Comment 33•13 years ago
|
||
> From this, I think we can conclude that no change is actually required.
I'd been using this bug to track the dependencies, which are cases when the main process dies due to what appears to be OOM caused by the child process.
I've looked at the kernel code and concur with your analysis. But /something/ is going on here. At the very least, perhaps all these dependent bugs aren't OOMs. Or perhaps the kernel isn't quick enough about killing the child process and kills the main process too, in an OOM situation.
I'll update the bug summary to more accurately reflect what I've been using it to track.
Assignee | ||
Updated•13 years ago
|
Summary: Tweak oom_adj and nice values for master/foreground/background processes (particularly to ensure that the parent process doesn't die on OOM, if that's actually happening) → Apparent child-process OOMs can kill the parent process, despite oom_adj scores which should preclude this
Whiteboard: [see comment 33]
Comment 34•13 years ago
|
||
So, if the lowmemory killer is killing the main process, it should be leaving some evidence in the dmesg log (the select and send sigkill messages).
If those aren't present, then its dying for some other reason.
Assignee | ||
Comment 35•13 years ago
|
||
All of the dependencies here are resolved as either not OOMs or not reproducible.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•