Closed Bug 989572 Opened 7 years ago Closed 7 years ago

[Tarako] LMK don't work fine, cause kernel out of memory and kill a lot of process and phone no respond

Categories

(Firefox OS Graveyard :: Performance, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:1.3T+, b2g-v1.3T fixed)

RESOLVED FIXED
1.4 S6 (25apr)
blocking-b2g 1.3T+
Tracking Status
b2g-v1.3T --- fixed

People

(Reporter: james.zhang, Assigned: ying.xu, NeedInfo)

References

Details

(Keywords: perf, Whiteboard: [c=memory p= s= u=tarako] [POVB][priority][wait for sprd kernel patch])

Attachments

(8 files)

Attached file 294706.rar
see kernel log in attachment and the keyword is 'Out of memory: Kill process'
294706/slog_20140327145600_sp6821a_userdebug/external_storage/last_log/20140327142655/kernel/kernel.log

03-27 14:47:16.130 <3>0[ 1181.686917] Out of memory: Kill process 81 (b2g) score 223 or sacrifice child
03-27 14:47:16.130 <3>0[ 1181.686985] Killed process 81 (b2g) total-vm:139004kB, anon-rss:0kB, file-rss:5632kB

03-27 14:46:23.700 <3>0[ 1169.324988] Out of memory: Kill process 1111 (Music) score 260 or sacrifice child
03-27 14:46:23.700 <3>0[ 1169.325055] Killed process 1111 (Music) total-vm:83808kB, anon-rss:528kB, file-rss:4544kB
blocking-b2g: --- → 1.3T?
Flags: needinfo?(ttsai)
Flags: needinfo?(styang)
Flags: needinfo?(kkuo)
Flags: needinfo?(gal)
Flags: needinfo?(fabrice)
Hi! Thinker,

Could someone of your team member take a look at this? Thanks

--
Keven
Flags: needinfo?(ttsai)
Flags: needinfo?(tlee)
Flags: needinfo?(kkuo)
By comment 0, it seems like some process use too much memory and zRAM work too hard to eat out of memory.  How could I read the attached log?  It is in an unknown format for me.  Do you have STR?
Flags: needinfo?(tlee)
Could you provide the output of the procrank?
(In reply to Thinker Li [:sinker] from comment #2)
> By comment 0, it seems like some process use too much memory and zRAM work
> too hard to eat out of memory.  How could I read the attached log?  It is in
> an unknown format for me.  Do you have STR?

Use edit or vim to ready this log.
see  294706/slog_20140327145600_sp6821a_userdebug/external_storage/last_log/20140327142655/kernel/kernel.log


A lot of STR, please register and access our bugzilla to check.
http://bugzilla.spreadtrum.com/bugzilla/show_bug.cgi?id=294616
Flags: needinfo?(fabrice)
I have seen similar effects before where one process is allocating a large chunk of memory and isn't return to the event loop quickly enough to handle the incoming memory pressure events.
Flags: needinfo?(gal)
Attached file kernel.log
Flags: needinfo?(styang)
(In reply to James Zhang from comment #4)
> (In reply to Thinker Li [:sinker] from comment #2)
> > By comment 0, it seems like some process use too much memory and zRAM work
> > too hard to eat out of memory.  How could I read the attached log?  It is in
> > an unknown format for me.  Do you have STR?
> 
> Use edit or vim to ready this log.
> see 
> 294706/slog_20140327145600_sp6821a_userdebug/external_storage/last_log/
> 20140327142655/kernel/kernel.log
> 
> 
> A lot of STR, please register and access our bugzilla to check.
> http://bugzilla.spreadtrum.com/bugzilla/show_bug.cgi?id=294616

I'm not able to access the bug, please add me in. Attach kernel log above.
ni? Thinker on the latest log. thanks
Flags: needinfo?(tlee)
Flags: needinfo?(tlee)
ting would look into this.
LMK did work:

03-27 14:39:12.426 <4>0[  741.897543] select 430 (Music), adj 10, size 6577, to kill
03-27 14:39:12.426 <4>0[  741.897562] send sigkill to 430 (Music), adj 10, size 6577
03-27 14:42:55.247 <4>0[  964.722728] select 601 (Gallery), adj 10, size 5837, to kill
03-27 14:42:55.247 <4>0[  964.722794] send sigkill to 601 (Gallery), adj 10, size 5837

and the very first message from OOM killer is:

03-27 14:45:14.690 4>0[ 1103.027669] engpcclient invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-16, oom_score_adj=-941

James, how should I reproduce this?
Flags: needinfo?(james.zhang)
James, sorry, I overlooked the STR at comment 4.
Slab unreclaimable occupies a lot of memory, grows even to around 87MB:

  03-27 14:45:14.690 <4>0[ 1103.029564] ...slab_unreclaimable:41016kB
  03-27 14:47:46.980 <4>0[ 1254.463877] ...slab_unreclaimable:89548kB

I inspected the number when a) b2g just starts, and b) there're several applications running, it wasn't really effected by zram:

a)
            b2g 1109    1   14.9    0 34.7 38.8 46.6 145.8       0 root    
         (Nuwa) 1167 1109    0.8    0  3.7  7.2 15.1  51.3       0 root    
     Homescreen 1184 1167    2.8    1 10.7 15.8 26.1  68.4       2 app_1184
(Preallocated a 1265 1167    0.6   18  4.7  8.6 17.5  60.2       1 root  

  root@android:/ # cat /sys/block/zram0/mem_used_total                           
  2420736

  root@android:/ # cat /proc/meminfo
  SUnreclaim:         6560 kB

b)
            b2g  83    1   48.4    0 22.1 23.6 30.3 153.6       0 root   
         (Nuwa) 342   83    1.2    0  0.0  0.9  5.2  51.3       0 root   
     Homescreen 372  342    5.6   18  0.6  1.9  9.1  65.5       8 app_372
          Music 686  342   10.0    7  2.6  4.0 11.2  68.9       6 app_686
        Gallery 724  342   10.1   18  0.5  1.8  8.9  66.5      11 app_724
       Settings 833  342    5.8   18  0.6  1.9  9.1  65.4      10 app_833
 Communications 944  342   12.6    1 13.8 15.2 22.4  76.1       2 app_944
(Preallocated a 996  342    0.4   18  0.9  1.3  4.0  54.2       1 root   

  root@android:/ # cat /sys/block/zram0/mem_used_total                           
  22597632

  root@android:/ # cat /proc/meminfo
  SUnreclaim:         6616 kB
Does kernel memory leak?
Ying, please check it with our kernel guys.
Flags: needinfo?(james.zhang) → needinfo?(ying.xu)
Ting, zRAM is boosted up from 2.4M to 22.5M.  Is it you looking wrong or pasting wrong?
I copied that directly from console, and I think it does make sense as there were several applicaitons around.
You should see last_msg log

From the kernel log, You can find slab_unreclaimable:43348kB and more and more bigger.
I suspect there was some memory leak in kernel.
I would enable the memory leak detector in kernel and run monkey-test.
Hope to reproduce this bug.
triage: 1.3T+ tracking kernel memory leak bug. [POVB]
Ying.xu, assign to you for now as you seem to be working on it. thanks
Assignee: nobody → ying.xu
blocking-b2g: 1.3T? → 1.3T+
Whiteboard: [POVB]
Whiteboard: [POVB] → [POVB][priority]
Severity: critical → blocker
Status: NEW → ASSIGNED
Keywords: perf
Priority: -- → P1
Whiteboard: [POVB][priority] → [c=memory p= s= u=tarako] [POVB][priority]
With Ying's patch. We can't reproduce this issue now.
Do mozilla guys meet this issue again ?


commit b23e47c77ecbca81b87edf58a7b358871f21e1d0
Author: ying.xu <ying.xu@spreadtrum.com>
Date:   Sat Mar 29 17:20:42 2014 +0800

    Bug#294616 increase proc/sys/vm/min_free_kbytes to 4096 KBytes
    
    [self test   ] pass
    
    Change-Id: Ib7b2bdb52481982c86d63c46bc408fd37f4c567e

diff --git a/sp6821a_gonk/init.rc b/sp6821a_gonk/init.rc
index 65c315a..5125f64 100644
--- a/sp6821a_gonk/init.rc
+++ b/sp6821a_gonk/init.rc
@@ -244,6 +244,7 @@ on boot
 # level system server to be able to adjust the kernel OOM driver
 # parameters to match how it is managing things.
     write /proc/sys/vm/overcommit_memory 1
+    write /proc/sys/vm/min_free_kbytes 4096
     write /proc/sys/vm/min_free_order_shift 4
     chown root system /sys/module/lowmemorykiller/parameters/adj
     chmod 0664 /sys/module/lowmemorykiller/parameters/adj
Flags: needinfo?(tlee)
Flags: needinfo?(tchou)
I tried only 4 or 5 times, and have never reproduced it. BTW, I am not sure is Ying's patch relevant for slab continually growing.
Flags: needinfo?(tchou)
(In reply to Ting-Yu Chou [:ting] from comment #19)
> I tried only 4 or 5 times, and have never reproduced it. BTW, I am not sure
> is Ying's patch relevant for slab continually growing.

But we meet zram leak now. It cause performance issue now.
Flags: needinfo?(tlee)
(In reply to James Zhang from comment #20)
> But we meet zram leak now. It cause performance issue now.

Is there a bug about this "zram leak"?

From comment 12, even zram consumes memory from 2.4MB to 22.5 MB, but still slab unreclaimable stays around 6.4MB.
This bug is still exist with Ying's minifree patch, see
http://bugzilla.spreadtrum.com/bugzilla/show_bug.cgi?id=297942
This bug has 100% reproduce rate now.
(In reply to Ting-Yu Chou [:ting] from comment #10)
> LMK did work:
> 
> 03-27 14:39:12.426 <4>0[  741.897543] select 430 (Music), adj 10, size 6577,
> to kill
> 03-27 14:39:12.426 <4>0[  741.897562] send sigkill to 430 (Music), adj 10,
> size 6577
> 03-27 14:42:55.247 <4>0[  964.722728] select 601 (Gallery), adj 10, size
> 5837, to kill
> 03-27 14:42:55.247 <4>0[  964.722794] send sigkill to 601 (Gallery), adj 10,
> size 5837
> 
> and the very first message from OOM killer is:
> 
> 03-27 14:45:14.690 4>0[ 1103.027669] engpcclient invoked oom-killer:
> gfp_mask=0xd0, order=0, oom_adj=-16, oom_score_adj=-941
> 
> James, how should I reproduce this?

We don't have the STR. But it did happen many times, through various operations
Flags: needinfo?(ying.xu)
Attached file Connect_A2_issue.rar
(In reply to Ting-Yu Chou [:ting] from comment #19)
> I tried only 4 or 5 times, and have never reproduced it. BTW, I am not sure
> is Ying's patch relevant for slab continually growing.

no,it should have no relation to value of slab_unreclaimable.
Because the value of slab_unreclaimable was too high
And most of processes were killed by OOM-kill,The value of slab_unreclaimable should be decreased.
But it was not, So I suspect there may be memory leak in kernel
After landing these two patches in kernel, I think the LMK should work properly.

commit 62841ea1775f84d5b0569ccb45c166944e3f5c90
Author: ying.xu <ying.xu@spreadtrum.com>
Date:   Sat Apr 5 16:17:15 2014 +0800

    Bug#245770 add swap check condition of low memory notify function
    
    [self test   ] test ok
    
    Change-Id: I1d90c3bc508e6fa31773b7ad2ebb02e297237ec2

commit dfca13faa3d0c12cea2115c0007e4f443b91954c
Author: ying.xu <ying.xu@spreadtrum.com>
Date:   Sat Apr 5 13:09:14 2014 +0800

    Bug#245770 add swap free memory check function
    
    [self test   ] kernel run ok, kill process ok
    
    Change-Id: I06b8ca31e323ccdf4e5face4a882fc575267ee3e
When run monkey test, I use "adb shell vmstat" to monitor memory status.
It has run about 12 hours, I don't find memory leak.
Flags: needinfo?(ttsai)
current LMK setting: music app in background can survive longer till 8172 KB.

 notify_trigger 14336 KB

  oom_adj min_free
        0  1024 KB
        1  2048 KB
        2  4096 KB
        6  8172 KB
        8 16384 KB
       10 18432 KB

commit 3945e13d8165bc6e0a5d865c6f12ce63954d910c
Author: ying.xu <ying.xu@spreadtrum.com>
Date:   Sat Apr 5 13:31:09 2014 +0800

    Bug#245770 change the LMK parameters
    
    [self test   ] kill process OK
    
    Change-Id: I04043c0911463652bd927be028eafe28a5cd243c
Flags: needinfo?(ttsai)
Update something abnormal in kernel log:
03-27 14:46:23.700 <3>0[ 1169.324988] Out of memory: Kill process 1111 (Music) score 260 or sacrifice child
03-27 14:46:23.700 <3>0[ 1169.325055] Killed process 1111 (Music) total-vm:83808kB, anon-rss:528kB, file-rss:4544kB
03-27 14:46:23.700 <4>0[ 1170.341033] slog: page allocation failure: order:0, mode:0x20058
...
03-27 14:46:23.702 <4>0[ 1170.343473] Normal: 4*4kB 26*8kB 33*16kB 15*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1360kB

Slog need to allocate 4kB (order 0) memory and there is also enough memory, we have 4 4kB pages and 1360kB free memory. It looks abnormal and then cause OOM killer to kill process. 

In this kernel log, we found several memory allocation failure but we still have memory:
1.
03-27 14:45:31.470 <4>0[ 1120.033113] sync_supers: page allocation failure: order:0, mode:0x20058
...
03-27 14:45:31.470 <4>0[ 1120.034977] Normal: 0*4kB 36*8kB 31*16kB 14*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1360kB

2.
03-27 14:45:31.470 <4>0[ 1120.773613] sync_supers: page allocation failure: order:0, mode:0x20058
...
03-27 14:45:39.043 <4>0[ 1120.775490] Normal: 1*4kB 33*8kB 31*16kB 14*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1340kB

3.
03-27 14:46:23.700 <4>0[ 1170.341033] slog: page allocation failure: order:0, mode:0x20058
...
03-27 14:46:23.702 <4>0[ 1170.343473] Normal: 4*4kB 26*8kB 33*16kB 15*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1360kB
 
hi James, please have member to check if any leak or memory allocation issue here.
Flags: needinfo?(james.zhang)
This is interesting:

cervantes@cyu-dt:/tmp/294706/slog_20140327145600_sp6821a_userdebug/external_storage/last_log/20140327142655/kernel$ grep "slab pages" kernel.log 
03-27 14:45:14.690 <4>0[ 1103.031831] 10543 slab pages
03-27 14:45:31.470 <4>0[ 1120.036695] 10541 slab pages
03-27 14:45:39.043 <4>0[ 1120.777375] 10587 slab pages
03-27 14:46:21.656 <4>0[ 1169.322580] 12717 slab pages
03-27 14:46:23.702 <4>0[ 1170.345162] 12731 slab pages
03-27 14:47:11.158 <4>0[ 1181.377434] 14990 slab pages
03-27 14:47:36.880 <4>0[ 1245.634627] 22606 slab pages
03-27 14:47:39.590 <4>0[ 1246.122733] 22639 slab pages
03-27 14:47:46.980 <4>0[ 1254.466034] 22633 slab pages
03-27 14:47:47.022 <4>0[ 1254.868664] 22693 slab pages
03-27 14:47:56.055 <4>0[ 1264.484709] 22733 slab pages
03-27 14:48:07.690 <4>0[ 1271.103246] 22691 slab pages
03-27 14:48:11.293 <4>0[ 1271.310682] 22640 slab pages
03-27 14:48:12.792 <4>0[ 1271.933731] 22712 slab pages
03-27 14:48:13.248 <4>0[ 1280.293407] 22720 slab pages
03-27 14:48:13.251 <4>0[ 1280.383515] 22777 slab pages
03-27 14:48:13.255 <4>0[ 1280.533541] 22791 slab pages
03-27 14:48:15.810 <4>0[ 1284.782953] 22808 slab pages
03-27 14:48:17.804 <4>0[ 1284.873316] 22815 slab pages
03-27 14:48:28.594 <4>0[ 1297.772646] 22786 slab pages
03-27 14:48:35.397 <4>0[ 1300.436390] 22869 slab pages
03-27 14:48:35.899 <4>0[ 1300.445163] 22869 slab pages
03-27 14:48:48.471 <4>0[ 1317.715951] 22988 slab pages

Obviously LMK isn't enough and even we have OOM killer the slab pages continue growing. By the end of this log OOM killer continues to kill other process even all b2g processes are dead:

03-27 14:48:48.471 <4>0[ 1317.714482] Total swap = 65532kB
03-27 14:48:48.471 <4>0[ 1317.715860] 28854 pages of RAM
03-27 14:48:48.471 <4>0[ 1317.715894] 807 free pages
03-27 14:48:48.471 <4>0[ 1317.715921] 1589 reserved pages
03-27 14:48:48.471 <4>0[ 1317.715951] 22988 slab pages
03-27 14:48:48.471 <4>0[ 1317.715978] 113 pages shared
03-27 14:48:48.471 <4>0[ 1317.716005] 0 pages swap cached
03-27 14:48:48.471 <6>0[ 1317.716035] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
03-27 14:48:48.471 <6>0[ 1317.716110] [  167]  1000   167     2089        0   0     -16          -941 engpcclient
03-27 14:48:48.471 <6>0[ 1317.716178] [  169]  1000   169      464        0   0     -16          -941 engservice
03-27 14:48:48.471 <6>0[ 1317.716244] [  312]     0   312     2856        0   0     -16          -941 slog
03-27 14:48:48.471 <6>0[ 1317.716307] [  399]  2000   399      818        0   0     -16          -941 tcp
03-27 14:48:48.471 <6>0[ 1317.716369] [  401]     0   401     2856        0   0     -16          -941 slog
03-27 14:48:48.471 <6>0[ 1317.716433] [ 1379]     0  1379      192       12   0     -16          -941 oom-msg-logger
03-27 14:48:48.471 <6>0[ 1317.716502] [ 1508]     0  1508     1453        0   0     -16          -941 b2g
03-27 14:48:48.471 <6>0[ 1317.716564] [ 1510]  1001  1510     1647        1   0     -16          -941 rild_sp
03-27 14:48:48.471 <6>0[ 1317.716629] [ 1513]  1001  1513     1647        1   0     -16          -941 rild_sp
03-27 14:48:48.471 <6>0[ 1317.716693] [ 1526]     0  1526       75       23   0     -16          -941 ueventd
03-27 14:48:48.471 <6>0[ 1317.716758] [ 1527]  1013  1527      382        0   0     -16          -941 mediaserver
03-27 14:48:48.471 <6>0[ 1317.716825] [ 1528]  1000  1528      207        1   0     -16          -941 servicemanager
03-27 14:48:48.471 <6>0[ 1317.716893] [ 1531]     0  1531      469        0   0     -16          -941 vold
03-27 14:48:48.471 <6>0[ 1317.716956] [ 1534]  1001  1534      205        1   0     -16          -941 rilproxy
03-27 14:48:48.471 <6>0[ 1317.717021] [ 1535]     0  1535      511        0   0     -16          -941 netd
03-27 14:48:48.473 173        0   0     -16          -941 debuggerd
03-27 14:48:48.473 <6>0[ 1317.717148] [ 1537]  1002  1537      329        0   0     -16          -941 dbus-daemon
03-27 14:48:48.473 <6>0[ 1317.717215] [ 1538]  1017  1538      432        0   0     -16          -941 keystore
03-27 14:48:48.473 <6>0[ 1317.717280] [ 1539]  1001  1539      205        1   0     -16          -941 rilproxy
03-27 14:48:48.473 <6>0[ 1317.717345] [ 1540]     0  1540      228        0   0     -16          -941 engmoded
03-27 14:48:48.473 <6>0[ 1317.717410] [ 1543]     0  1543      858       28   0     -16          -941 adbd
03-27 14:48:48.473 <6>0[ 1317.717472] [ 1547]     0  1547      176        0   0     -16          -941 sh
03-27 14:48:48.473 <3>0[ 1317.717531] Out of memory: Kill process 167 (engpcclient) score 1 or sacrifice child

This looks like a kernel leak.
Summary: [Tarako] gecko LMK don't work fine, cause kernel out of memory and kill a lot of process and phone no respond → [Tarako] LMK don't work fine, cause kernel out of memory and kill a lot of process and phone no respond
James, please make sure this is still reproducible, and check if it is a kernel leak.
Flags: needinfo?(james.zhang)
Do you think it good to close this bug?  You could open another bug once you make sure the leaking being fixed, but still have something wrong.
Flags: needinfo?(james.zhang)
In kernel log from attachment 8398866 [details], we can see
-slab pages increase significantly as described comment 30

In kernel log from attachment 8402136 [details],
-slab pages is stablely around 18xx

Ying, can you confrim if it was fixed by patches in comment 26 and is there still memory leak in kernel?
Flags: needinfo?(ying.xu)
From another point of veiw, LMK vs OOM.

In kernel log from attachment 8398866 [details],
-OOM kill a lot of process include vold

In kernel log from attachment 8402136 [details],
-OOM kill b2g only twice. 

When b2g's being killed, there are still some processes under running, such as Homescreen, Marketplace and setting etc. 
Suppose if LMK is working properly, when free memory and cache is lower than min_free, some processes should be killed and OOM will not be triggered.


Ying, can you help to check if there is any error in LMK and why OOM is still triggered?
(In reply to Kai-Zhen Li from comment #34)
> From another point of veiw, LMK vs OOM.
> 
> In kernel log from attachment 8398866 [details],
> -OOM kill a lot of process include vold
> 
> In kernel log from attachment 8402136 [details],
> -OOM kill b2g only twice. 
> 
> When b2g's being killed, there are still some processes under running, such
> as Homescreen, Marketplace and setting etc. 
> Suppose if LMK is working properly, when free memory and cache is lower than
> min_free, some processes should be killed and OOM will not be triggered.
> 
> 
> Ying, can you help to check if there is any error in LMK and why OOM is
> still triggered?

Ying, OOM was triggered when free memory and cache is lower than min_free. Could we set the value as default to see if the issue still happen?

Author: ying.xu <ying.xu@spreadtrum.com>
Date:   Sat Mar 29 17:20:42 2014 +0800

    Bug#294616 increase proc/sys/vm/min_free_kbytes to 4096 KBytes
    
    [self test   ] pass
    
    Change-Id: Ib7b2bdb52481982c86d63c46bc408fd37f4c567e
yes

We're going to revert these settings(vm, omm_adj, KillUnderKB) to initial value(@MWC).
Then try and error.
Flags: needinfo?(ying.xu)
James, pre-discussed today, please have member to check LMK in kernel to see if any policy or algorithm need to refine. thanks.
Attached file kernel.log
Ying, I had a hack in kernel OOM, do nothing and return when out_of_memory is called and then I run monkey test for 8 hours. I found that b2g and nuwa will not be killed by LMK. This is b2g-info at the end of the test.

#b2g-info
                           |      megabytes     |
           NAME   PID PPID  CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER    
            b2g    83    1 15191.6    0 23.1 27.3 33.8 172.1       0 root    
         (Nuwa)   356   83   254.4    0  0.2  1.2  3.4  77.6       0 root    
 Communications  9976  356  3188.5    0  4.6  8.3 14.1 109.3       1 app_9976
(Preallocated a 20750  356     0.9   18  5.1  7.3 11.6  80.6       2 root
(In reply to Kai-Zhen Li from comment #38)
> Created attachment 8403745 [details]
> kernel.log
> 
> Ying, I had a hack in kernel OOM, do nothing and return when out_of_memory
> is called and then I run monkey test for 8 hours. I found that b2g and nuwa
> will not be killed by LMK. This is b2g-info at the end of the test.
> 
> #b2g-info
>                            |      megabytes     |
>            NAME   PID PPID  CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER    
>             b2g    83    1 15191.6    0 23.1 27.3 33.8 172.1       0 root    
>          (Nuwa)   356   83   254.4    0  0.2  1.2  3.4  77.6       0 root    
>  Communications  9976  356  3188.5    0  4.6  8.3 14.1 109.3       1 app_9976
> (Preallocated a 20750  356     0.9   18  5.1  7.3 11.6  80.6       2 root

Can you try image.baidu.com with userdebug build?
It always cause OOM.
Flags: needinfo?(james.zhang)
We found a little memory leak in kernel now, but it's only 2~3KB, We'll fix memory leak as soon as possible.
(In reply to James Zhang from comment #39)
> (In reply to Kai-Zhen Li from comment #38)
> > Created attachment 8403745 [details]
> > kernel.log
> > 
> > Ying, I had a hack in kernel OOM, do nothing and return when out_of_memory
> > is called and then I run monkey test for 8 hours. I found that b2g and nuwa
> > will not be killed by LMK. This is b2g-info at the end of the test.
> > 
> > #b2g-info
> >                            |      megabytes     |
> >            NAME   PID PPID  CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER    
> >             b2g    83    1 15191.6    0 23.1 27.3 33.8 172.1       0 root    
> >          (Nuwa)   356   83   254.4    0  0.2  1.2  3.4  77.6       0 root    
> >  Communications  9976  356  3188.5    0  4.6  8.3 14.1 109.3       1 app_9976
> > (Preallocated a 20750  356     0.9   18  5.1  7.3 11.6  80.6       2 root
> 
> Can you try image.baidu.com with userdebug build?
> It always cause OOM.

Browsing on image.baidu.com with build on 4/8, we don't meet b2g was killed by OOM.

The current settings as below:
1. adb shell cat /proc/sys/vm/min_free_kbytes
1352
2. Low-memory killer parameters:

  notify_trigger 14336 KB

  oom_adj min_free
        0  4096 KB
        1  5120 KB
        2  6144 KB
        6  7168 KB
        8  8192 KB
       10 20480 KB
(In reply to Danny Liang [:dliang] from comment #41)
> 
> Browsing on image.baidu.com with build on 4/8, we don't meet b2g was killed
> by OOM.
> 
> The current settings as below:
> 1. adb shell cat /proc/sys/vm/min_free_kbytes
> 1352

I think it is because 1352 is much lower than 4096, so kenrel's OOM is hard to be triggered.
Should current LMK setting be kept, but lower OOM to less than 1024K?
 12 pref("hal.processPriorityManager.gonk.MASTER.KillUnderKB", 1024);
 13 
 14 pref("hal.processPriorityManager.gonk.FOREGROUND_HIGH.KillUnderKB", 2048);
 15 
 16 pref("hal.processPriorityManager.gonk.FOREGROUND.KillUnderKB", 4096);
 17 
 18 pref("hal.processPriorityManager.gonk.BACKGROUND_PERCEIVABLE.KillUnderKB", 8172);
 19 
 20 pref("hal.processPriorityManager.gonk.BACKGROUND_HOMESCREEN.KillUnderKB", 16384);
 21 
 22 pref("hal.processPriorityManager.gonk.BACKGROUND.KillUnderKB", 18432);
(In reply to thomas tsai from comment #43)
> Should current LMK setting be kept, but lower OOM to less than 1024K?
>  12 pref("hal.processPriorityManager.gonk.MASTER.KillUnderKB", 1024);
>  13 
>  14 pref("hal.processPriorityManager.gonk.FOREGROUND_HIGH.KillUnderKB",
> 2048);
>  15 
>  16 pref("hal.processPriorityManager.gonk.FOREGROUND.KillUnderKB", 4096);
>  17 
>  18
> pref("hal.processPriorityManager.gonk.BACKGROUND_PERCEIVABLE.KillUnderKB",
> 8172);
>  19 
>  20
> pref("hal.processPriorityManager.gonk.BACKGROUND_HOMESCREEN.KillUnderKB",
> 16384);
>  21 
>  22 pref("hal.processPriorityManager.gonk.BACKGROUND.KillUnderKB", 18432);

The value is mozilla provide before, we have reverted all these changes.
And we won't merge any LMK value change if this value without a lot of test cycles.
Flags: needinfo?(ying.xu)
(In reply to thomas tsai from comment #43)
> Should current LMK setting be kept, but lower OOM to less than 1024K?
>  12 pref("hal.processPriorityManager.gonk.MASTER.KillUnderKB", 1024);
>  14 pref("hal.processPriorityManager.gonk.FOREGROUND_HIGH.KillUnderKB",2048);
>  16 pref("hal.processPriorityManager.gonk.FOREGROUND.KillUnderKB", 4096);

I think We should not lower the value of OOM trigger in kernel,currently the min_free_kbytes is 1352KB, it's  a default set by kernel.

We can increase the value of KillUnderKB ,at least greater than the value of OMM trigger.

We want to keep these values:
> pref("hal.processPriorityManager.gonk.BACKGROUND_PERCEIVABLE.KillUnderKB", 8172);
> pref("hal.processPriorityManager.gonk.BACKGROUND_HOMESCREEN.KillUnderKB",16384); 
> pref("hal.processPriorityManager.gonk.BACKGROUND.KillUnderKB", 18432);

And update these values
>  12 pref("hal.processPriorityManager.gonk.MASTER.KillUnderKB", 1024);
>  14 pref("hal.processPriorityManager.gonk.FOREGROUND_HIGH.KillUnderKB",2048);
>  16 pref("hal.processPriorityManager.gonk.FOREGROUND.KillUnderKB", 4096);
pref("hal.processPriorityManager.gonk.MASTER.KillUnderKB", 4096);
pref("hal.processPriorityManager.gonk.FOREGROUND_HIGH.KillUnderKB", 5120);
pref("hal.processPriorityManager.gonk.FOREGROUND.KillUnderKB", 6144);
Flags: needinfo?(ying.xu)
There is still Nuwa was killed by OOM, we met twice in monkey test.
1. Nuwa was killed by OOM but camera (oom_adj=2) application is still alive.
2. Nuwa was killed by OOM but camera (oom_adj=2) and communication (oom_adj=1) are still alive.

In these two cases, LMK didn't work due to free+cache is still higher than threshold, but OOM was been trigger due to free memory is only 1360KB.

Low-memory killer parameters:
  notify_trigger 14336 KB
  oom_adj min_free
        0  4096 KB
        1  5120 KB
        2  6144 KB
        6  7168 KB
        8  8192 KB
       10 20480 KB

Moreover, with this LMK setting, homescreen (oom_adj=8) at background won't be killed by LMK due to threshold of oom_adj=8 is 8192KB, it's hard to reach it.
Comment on attachment 8405281 [details]
mtlog_monkey_0411.tar.gz

Attached kernel log and memory information.
I had a monkey test from last night with the following LMK setting and b2g hasn't been killed during the test. 
---
  oom_adj min_free
        0  1024 KB
        1  2048 KB
        2  4096 KB
        6  6144 KB
        8 16384 KB
       10 20480 KB
---

But, with this set of values, backgroup process will be killed earlier. I think it'll be better also have anohter test where min_free is 10240KB for oom_adj=8.

BTW, I also think "/proc/sys/vm/min_free_kbytes" should be less that min_free of oom_adj=0.
Attached file updated log
updated log
I find oom-killer was involved mostly by userspace process.
But LMK was involved mostly by kswapd, sometimes by userspace process.
If LMK could be involved when userspace process allocated memory.
The problem will gone.
/20140418130229/kernel$ grep -ir 'invoked lowmemorykiller' kernel.log  -c
77

/20140418130229/kernel$ grep -ir 'kswapd0 invoked lowmemorykiller' kernel.log  -c
65

Here, oom-killer should happen. 
/20140418130229/kernel$ grep -ir 'invoked oom-killer' kernel.log  -c
59

Now,there are 2 suspicions .
1. LMK was not called by memory_shrink at all;
2. LMK was called by memory_shrink, but not meet the condition of kill process.

I will add some log to check this.
04-18 22:07:55.506 <4>0[ 3933.657617] kswapd0 invoked lowmemorykiller: gfp_mask=0xd0, oom_adj=0, oom_score_adj=0
04-18 22:07:55.506 <4>0[ 3933.657678] Backtrace: 
04-18 22:07:55.506 <4>0[ 3933.657745] [<c4537ad8>] (dump_backtrace+0x0/0x110) from [<c4889f0c>] (dump_stack+0x18/0x1c)
04-18 22:07:55.506 <4>0[ 3933.657813]  r7:c70f3f78 r6:000000d0 r5:c15c57a0 r4:c70f2000
04-18 22:07:55.506 <4>0[ 3933.657901] [<c4889ef4>] (dump_stack+0x0/0x1c) from [<c4779804>] (lowmem_shrink+0x460/0x528)
04-18 22:07:55.506 <4>0[ 3933.657978] [<c47793a4>] (lowmem_shrink+0x0/0x528) from [<c45a45c8>] (shrink_slab+0x114/0x1c0)
04-18 22:07:55.506 <4>0[ 3933.658051] [<c45a44b4>] (shrink_slab+0x0/0x1c0) from [<c45a4bfc>] (kswapd+0x588/0x924)
04-18 22:07:55.506 <4>0[ 3933.658122] [<c45a4674>] (kswapd+0x0/0x924) from [<c456b7d0>] (kthread+0x8c/0x94)
04-18 22:07:55.506 <4>0[ 3933.658192] [<c456b744>] (kthread+0x0/0x94) from [<c45547a4>] (do_exit+0x0/0x5fc)


04-18 22:43:39.183 <4>0[ 6077.330405] Compositor invoked lowmemorykiller: gfp_mask=0x286d2, oom_adj=0, oom_score_adj=0
04-18 22:43:39.183 <4>0[ 6077.330417] Backtrace: 
04-18 22:43:39.183 <4>0[ 6077.330456] [<c4537ad8>] (dump_backtrace+0x0/0x110) from [<c4889f0c>] (dump_stack+0x18/0x1c)
04-18 22:43:39.183 <4>0[ 6077.330472]  r7:c6281d80 r6:000286d2 r5:c59ef440 r4:c6280000
04-18 22:43:39.183 <4>0[ 6077.330510] [<c4889ef4>] (dump_stack+0x0/0x1c) from [<c4779804>] (lowmem_shrink+0x460/0x528)
04-18 22:43:39.183 <4>0[ 6077.330533] [<c47793a4>] (lowmem_shrink+0x0/0x528) from [<c45a45c8>] (shrink_slab+0x114/0x1c0)
04-18 22:43:39.183 <4>0[ 6077.330554] [<c45a44b4>] (shrink_slab+0x0/0x1c0) from [<c45a5180>] (try_to_free_pages+0x1e8/0x368)
04-18 22:43:39.183 <4>0[ 6077.330581] [<c45a4f98>] (try_to_free_pages+0x0/0x368) from [<c459ce98>] (__alloc_pages_nodemask+0x368/0x598)
04-18 22:43:39.183 <4>0[ 6077.330631] [<c459cb30>] (__alloc_pages_nodemask+0x0/0x598) from [<bf003ed4>] (os_allocate+0x11c/0x38c [ump])
04-18 22:43:39.183 <4>0[ 6077.330677] [<bf003db8>] (os_allocate+0x0/0x38c [ump]) from [<bf002b08>] (_ump_ukk_allocate+0x1f4/0x380 [ump])
04-18 22:43:39.183 <4>0[ 6077.330727] [<bf002914>] (_ump_ukk_allocate+0x0/0x380 [ump]) from [<bf0057cc>] (ump_secure_id_from_phys_blocks_wrapper+0x308/0x450 [ump])
04-18 22:43:39.183 <4>0[ 6077.330807] [<bf0056e0>] (ump_secure_id_from_phys_blocks_wrapper+0x21c/0x450 [ump]) from [<bf003754>] (ump_file_ioctl+0x11c/0x238 [ump])
04-18 22:43:39.183 <4>0[ 6077.330824]  r8:c4534244 r7:0000000c r6:c0109002 r5:45bff908 r4:45bff908
04-18 22:43:39.183 <4>0[ 6077.330871] [<bf003638>] (ump_file_ioctl+0x0/0x238 [ump]) from [<c45d7eb0>] (do_vfs_ioctl+0x540/0x5c4)
04-18 22:49:10.445 <4>0[ 6408.596004] Compositor invoked oom-killer: gfp_mask=0x286d2, order=0, oom_adj=0, oom_score_adj=0
04-18 22:49:10.445 <4>0[ 6408.596091] Backtrace: 
04-18 22:49:10.445 <4>0[ 6408.596153] [<c4537ad8>] (dump_backtrace+0x0/0x110) from [<c4889f0c>] (dump_stack+0x18/0x1c)
04-18 22:49:10.445 <4>0[ 6408.596272]  r7:000286d2 r6:000286d2 r5:00000000 r4:c6280000
04-18 22:49:10.445 <4>0[ 6408.596352] [<c4889ef4>] (dump_stack+0x0/0x1c) from [<c4599464>] (dump_header.clone.1+0x7c/0xac)
04-18 22:49:10.445 <4>0[ 6408.596428] [<c45993e8>] (dump_header.clone.1+0x0/0xac) from [<c45994d4>] (oom_kill_process.clone.0+0x40/0x26c)
04-18 22:49:10.445 <4>0[ 6408.596503]  r6:00000000 r5:00000154 r4:c59ef7a0
04-18 22:49:10.445 <4>0[ 6408.596565] [<c4599494>] (oom_kill_process.clone.0+0x0/0x26c) from [<c45999b4>] (out_of_memory+0x2b4/0x3a0)
04-18 22:49:10.445 <4>0[ 6408.596649] [<c4599700>] (out_of_memory+0x0/0x3a0) from [<c459cfb4>] (__alloc_pages_nodemask+0x484/0x598)
04-18 22:49:10.445 <4>0[ 6408.596757] [<c459cb30>] (__alloc_pages_nodemask+0x0/0x598) from [<bf003ed4>] (os_allocate+0x11c/0x38c [ump])
04-18 22:49:10.445 <4>0[ 6408.596863] [<bf003db8>] (os_allocate+0x0/0x38c [ump]) from [<bf002b08>] (_ump_ukk_allocate+0x1f4/0x380 [ump])
04-18 22:49:10.445 <4>0[ 6408.596970] [<bf002914>] (_ump_ukk_allocate+0x0/0x380 [ump]) from [<bf0057cc>] (ump_secure_id_from_phys_blocks_wrapper+0x308/0x450 [ump])
04-18 22:49:10.445 <4>0[ 6408.597094] [<bf0056e0>] (ump_secure_id_from_phys_blocks_wrapper+0x21c/0x450 [ump]) from [<bf003754>] (ump_file_ioctl+0x11c/0x238 [ump])
04-18 22:49:10.445 <4>0[ 6408.597183]  r8:c4534244 r7:0000000c r6:c0109002 r5:45bff908 r4:45bff908
04-18 22:49:10.445 <4>0[ 6408.597287] [<bf003638>] (ump_file_ioctl+0x0/0x238 [ump]) from [<c45d7eb0>] (do_vfs_ioctl+0x540/0x5c4)
we should dig out the code of __alloc_pages_nodemask
which causes two code branches LMK/oom-killer
LMK has been called before involving oom-killer.
But returned due to this code, here : sc->nr_to_scan <= 0, checked

	if (sc->nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
			     sc->nr_to_scan, sc->gfp_mask, rem);
		return rem;
	}

<4>0[  144.806796] lowmem_shrink 0, 200da, return 15851



<4>0[  144.806742] lowmem_shrink swapfree 5699, min_swapfree 22796, min_adj 16 
<4>0[  144.806796] lowmem_shrink 0, 200da, return 15851
<4>0[  144.806860] lowmem_shrink swapfree 5699, min_swapfree 22796, min_adj 16 
<4>0[  144.806915] lowmem_shrink 0, 200da, return 15851
<4>0[  144.806989] lowmem_shrink swapfree 5699, min_swapfree 22796, min_adj 16 
<4>0[  144.807044] lowmem_shrink 0, 200da, return 15851
<4>0[  144.807166] lowmem_shrink swapfree 5699, min_swapfree 22796, min_adj 16 
<4>0[  144.807221] lowmem_shrink 0, 200da, return 15851
<4>0[  144.807413] lowmem_shrink swapfree 5699, min_swapfree 22796, min_adj 16 
<4>0[  144.807469] lowmem_shrink 0, 200da, return 15851
<4>0[  144.807825] (Preallocated a invoked oom-killer: gfp_mask=0x200da, order=0, oom_adj=1, oom_score_adj=67
<4>0[  144.807897] Backtrace: 
<4>0[  144.807953] [<c4537ad8>] (dump_backtrace+0x0/0x110) from [<c4889f4c>] (dump_stack+0x18/0x1c)
<4>0[  144.808019]  r7:000200da r6:000200da r5:00000000 r4:c6f6e000
<4>0[  144.808096] [<c4889f34>] (dump_stack+0x0/0x1c) from [<c4599464>] (dump_header.clone.1+0x7c/0xac)
<4>0[  144.808173] [<c45993e8>] (dump_header.clone.1+0x0/0xac) from [<c45994d4>] (oom_kill_process.clone.0+0x40/0x26c)
<4>0[  144.808247]  r6:00000000 r5:00000290 r4:c2f40360
<4>0[  144.808309] [<c4599494>] (oom_kill_process.clone.0+0x0/0x26c) from [<c45999b4>] (out_of_memory+0x2b4/0x3a0)
<4>0[  144.808393] [<c4599700>] (out_of_memory+0x0/0x3a0) from [<c459cff4>] (__alloc_pages_nodemask+0x484/0x598)
<4>0[  144.808477] [<c459cb70>] (__alloc_pages_nodemask+0x0/0x598) from [<c45bc248>] (read_swap_cache_async+0x58/0x1b0)
<4>0[  144.808562] [<c45bc1f0>] (read_swap_cache_async+0x0/0x1b0) from [<c45bc42c>] (swapin_readahead+0x8c/0x94)
<4>0[  144.808646] [<c45bc3a0>] (swapin_readahead+0x0/0x94) from [<c45ae800>] (handle_pte_fault+0x2bc/0x6cc)
<4>0[  144.808726] [<c45ae544>] (handle_pte_fault+0x0/0x6cc) from [<c45af184>] (handle_mm_fault+0xd0/0xe4)
<4>0[  144.808806] [<c45af0b4>] (handle_mm_fault+0x0/0xe4) from [<c453b808>] (do_page_fault+0xe8/0x28c)
<4>0[  144.808888] [<c453b720>] (do_page_fault+0x0/0x28c) from [<c4528288>] (do_DataAbort+0x3c/0xa0)
<4>0[  144.808968] [<c452824c>] (do_DataAbort+0x0/0xa0) from [<c4533dcc>] (ret_from_exception+0x0/0x10)
(In reply to ying.xu from comment #55)
> LMK has been called before involving oom-killer.
> But returned due to this code, here : sc->nr_to_scan <= 0, checked
> 
> 	if (sc->nr_to_scan <= 0 || min_adj == OOM_ADJUST_MAX + 1) {
> 		lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
> 			     sc->nr_to_scan, sc->gfp_mask, rem);
> 		return rem;
> 	}
> 
> <4>0[  144.806796] lowmem_shrink 0, 200da, return 15851

For looking into the code, this is normal.  |try_to_free_pages()| call memory shrinker with 0 as the value of the nr_to_scan argument to return how many pages are potentially to free.
(In reply to Thinker Li [:sinker] from comment #56)

> For looking into the code, this is normal.  |try_to_free_pages()| call
> memory shrinker with 0 as the value of the nr_to_scan argument to return how
> many pages are potentially to free.

yes.
But inside the function. nr_pages_scanned will be checked and updated . which is SWAP_CLUSTER_MAX (32)

unsigned long shrink_slab(struct shrink_control *shrink,
			  unsigned long nr_pages_scanned,
			  unsigned long lru_pages)
{
	struct shrinker *shrinker;
	unsigned long ret = 0;

	if (nr_pages_scanned == 0)
		nr_pages_scanned = SWAP_CLUSTER_MAX;
And I think the check condition of while loop may failed
so do_shrinker_shrink with ns_to_scan!=0 may not be called.

		while (total_scan >= SHRINK_BATCH) {
			long this_scan = SHRINK_BATCH;
			int shrink_ret;
			int nr_before;

			nr_before = do_shrinker_shrink(shrinker, shrink, 0);
			shrink_ret = do_shrinker_shrink(shrinker, shrink,
							this_scan);
			if (shrink_ret == -1)
				break;
			if (shrink_ret < nr_before)
				ret += nr_before - shrink_ret;
			count_vm_events(SLABS_SCANNED, this_scan);
			total_scan -= this_scan;

			cond_resched();
		}
lmk returns a big number for |max_pass| at the same funciton, except there is no any killable process.  So, it is very unlikely to skip second call of |do_shrinker_shrink()| while there has some thing that could be killed by lmk.  Maybe, we should monitor this value and add a log entry for this to make sure it running correctly.
(In reply to ying.xu from comment #53)
> (ump_secure_id_from_phys_blocks_wrapper+0x21c/0x450 [ump]) from [<bf003754>]
> (ump_file_ioctl+0x11c/0x238 [ump])
> 04-18 22:49:10.445 <4>0[ 6408.597183]  r8:c4534244 r7:0000000c r6:c0109002
> r5:45bff908 r4:45bff908
> 04-18 22:49:10.445 <4>0[ 6408.597287] [<bf003638>] (ump_file_ioctl+0x0/0x238
> [ump]) from [<c45d7eb0>] (do_vfs_ioctl+0x540/0x5c4)

For some driver, it would request free page from kernel prohibiting mem shrinker.  It means once free memory is up OOM is called without trying mem shrinker.  I guess ump is one of them.
Forget my previous comment, I have found another path for mem shrinker.
(In reply to ying.xu from comment #50)
> Created attachment 8409282 [details]
> updated log

Your log is very noisy, too much useless files in the tarball.  I don't know which file is really useful.  Could you attach only useful files here next time?  Thanks!
(In reply to Thinker Li [:sinker] from comment #62)
> Your log is very noisy, too much useless files in the tarball.  I don't know
> which file is really useful.  Could you attach only useful files here next
> time?  Thanks!

sorry
please refer this path slog_external/20140418130229 in the packed file.

slog_external/20140418130229$ du -ch
13M	./kernel    //kernel log
70M	./android   //android log
700K	./misc
Hi solomon: can you analyse compositor? Thanks!
Flags: needinfo?(schiu)
I cannot found abnormal message in the log, except one thing about gralloc is unusual:

04-18 21:31:18.370    89  2097 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-18 21:31:18.460    89  2097 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-18 21:45:30.270    89  3095 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-18 21:45:30.310    89  3095 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-18 22:35:02.870    89  6424 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-18 22:35:02.870    89  6424 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-18 23:49:47.550 10869 12089 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-18 23:49:47.550 10869 12089 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-18 23:49:47.550 10869 12089 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 01:06:19.660 10869 16611 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 01:06:19.660 10869 16611 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 01:06:19.710 10869 16611 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 02:12:53.550 10869 19054 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 02:12:53.560 10869 19054 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 02:49:42.400 10869 20380 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 02:49:42.400 10869 20380 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 02:49:42.420 10869 20380 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 04:35:10.380 10869 24643 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 04:35:10.410 10869 24643 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 06:12:16.650 10869 28589 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 06:12:16.650 10869 28589 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 06:12:16.670 10869 28589 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 06:30:03.290 10869 29298 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 06:30:03.310 10869 29298 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 07:37:03.130 10869 31883 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 07:37:03.130 10869 31883 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 07:37:03.180 10869 31883 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 07:57:48.000 10869 32702 E SprdCameraHardware: Could not dequeue gralloc buffer!
04-19 07:57:48.050 10869 32702 E SprdCameraHardware: Could not dequeue gralloc buffer!
Flags: needinfo?(schiu)
Hi All,
After discussion w/ Yiwen, I listed some points below:
1. b2g/Nuwa were killed by OOM. In these cases, there is still application at bg/fg and should be killed first.
2. slog only record the log till 18764 secs, but sysdump showed system is running till 138261 secs, is slog killed by anyone?
3. By ps status dump from Yiwen, adbd looks restart.
4. There is a lot of camera error in system dump. Free memory is more than 9MB.

Below are the action items of LMK/OOM.
1. Change LMK threshold of homescreen to 16MB to make sure homescreen can be killed at background.
2. Add the mechanism to kill application between LMK and OOM.
3. Optimize the seek value in mm/vmscan.c in order to avoid LMK didn't work.
4. Make calculation of file cache more accurate

Hi James,
Could you have member to check the camera error in sysdump?

OOM log:
04-19 23:08:47.948 <6>0[12519.033855] R (running)        14438      76796K    18752K      17600K    12631K    10004K    12964K Messages
04-19 23:08:47.948 <6>0[12519.035123] R (running)        14544      65320K    16752K      16748K    11524K     8656K     3684K E-Mail
...
04-19 23:08:47.948 <3>0[12519.035678] Out of memory: Kill process 629 (b2g) score 334 or sacrifice child
04-19 23:08:47.948 <3>0[12519.035713] Killed process 6229 ((Nuwa)) total-vm:51688kB, anon-rss:64kB, file-rss:788kB

04-19 23:23:51.700 <6>0[13422.718145] R (running)        15591      80088K    13452K      12876K     8077K     6488K    12936K Messages
04-19 23:23:51.700 <6>0[13422.718254] S (sleeping)       15608       1804K      116K        112K        6K        0K      148K hciattach
04-19 23:23:51.700 <6>0[13422.718362] S (sleeping)       15614       2140K      120K        120K        8K        0K      264K bluetoothd
04-19 23:23:51.700 <6>0[13422.719589] D (disk sleep)     15660      65920K    15496K      14920K     9745K     7792K     5096K Communications
...
04-19 23:23:51.700 <3>0[13422.720474] Out of memory: Kill process 629 (b2g) score 345 or sacrifice child
04-19 23:23:51.700 <3>0[13422.720560] Killed process 14736 ((Nuwa)) total-vm:51688kB, anon-rss:352kB, file-rss:2148kB

04-19 23:25:34.099 <6>0[13525.024321] R (running)        15724      72708K    15032K      13880K    10497K     8672K    10660K Homescreen
...
04-19 23:25:34.099 <3>0[13525.026461] Out of memory: Kill process 629 (b2g) score 341 or sacrifice child
04-19 23:25:34.099 <3>0[13525.026540] Killed process 15840 ((Nuwa)) total-vm:51688kB, anon-rss:0kB, file-rss:1816kB

04-19 23:25:41.321 <6>0[13532.335203] R (running)        15362      58852K     4584K       4580K     2880K     2320K     6860K Browser
04-19 23:25:41.321 <6>0[13532.336936] R (running)        15724      72772K    16948K      15796K    12338K    10312K     9120K Homescreen
...
04-19 23:25:41.321 <3>0[13532.338404] Out of memory: Kill process 629 (b2g) score 343 or sacrifice child
04-19 23:25:41.321 <3>0[13532.338477] Killed process 629 (b2g) total-vm:169604kB, anon-rss:29064kB, file-rss:7136kB

sysdump:
0[138258.814601] V4L2: fail: set_next_buffer filled buffer is 0.
0[138258.890422] V4L2: path1_done_buffer: Fail to this entry. last addr: 7a25200, buf addr: 7a25200
0[138258.895825] dcam_cb_ISRCapSOF.
0[138258.895873] V4L2: fail: set_next_buffer filled buffer is 0.
0[138258.921666] 116U
0[138258.971676] V4L2: path1_done_buffer: Fail to this entry. last addr: 7a25200, buf addr: 7a25200
0[138258.977113] dcam_cb_ISRCapSOF.
0[138258.977159] V4L2: fail: set_next_buffer filled buffer is 0.
0[138259.010461] sprd_debug_check_crash_key: Crash key count : 2
0[138259.010527] Kernel panic - not syncing: Crash Key
0[138259.010568] Mem-info:
0[138259.010593] Normal per-cpu:
0[138259.010623] CPU    0: hi:   42, btch:   7 usd:  12
0[138259.010675] active_anon:4794 inactive_anon:4880 isolated_anon:0
0[138259.010683]  active_file:621 inactive_file:574 isolated_file:0
0[138259.010691]  unevictable:83 dirty:0 writeback:0 unstable:0
0[138259.010698]  free:2421 slab_reclaimable:326 slab_unreclaimable:1166
0[138259.010706]  mapped:734 shmem:3392 pagetables:457 bounce:0
0[138259.010915] Normal free:9684kB min:512kB low:1152kB high:1280kB active_anon:19176kB inactive_anon:19520kB active_file:2484kB inactive_file:2296kB unevictable:332kB isolated(anon):0kB isolated(file):0kB present:114408kB mlocked:0kB dirty:0kB writeback:0kB mapped:2936kB shmem:13568kB slab_reclaimable:1304kB slab_unreclaimable:4664kB kernel_stack:2048kB pagetables:1828kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
0[138259.011179] lowmem_reserve[]: 0 0 0
0[138259.011225] Normal: 2217*4kB 98*8kB 2*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9684kB
0[138259.011356] 4969 total pagecache pages
0[138259.011389] 289 pages in swap cache
0[138259.011423] Swap cache stats: add 2317039, delete 2316750, find 128810/1277541
0[138259.011479] Free swap  = 0kB
0[138259.011506] Total swap = 65532kB
0[138259.012739] 28854 pages of RAM
0[138259.012770] 3055 free pages
0[138259.012797] 3771 reserved pages
0[138259.012826] 1109 slab pages
0[138259.012852] 3161 pages shared
0[138259.012880] 289 pages swap cached
0[138260.020092]
0[138260.020124] *****************************************************
0[138260.020173] *                                                   *
0[138260.020221] *  Sysdump enter, preparing debug info to dump ...  *
0[138260.020269] *                                                   *
0[138260.020316] *****************************************************
0[138260.020363]
0[138261.029526] vaddr is c7f00000,paddr is 07f00000
0[138261.029588] reason: Crash Key, sprd_sysdump_info->crash_key: 1
Flags: needinfo?(james.zhang)
kernel and system dump
Whiteboard: [c=memory p= s= u=tarako] [POVB][priority] → [c=memory p= s= u=tarako] [POVB][priority][wait for sprd kernel patch]
Danny, please discuss camera error log with dafeng.xu. His seat is near Ying.
Flags: needinfo?(james.zhang) → needinfo?(Dafeng.Xu)
(In reply to Danny Liang [:dliang] from comment #66)

> Below are the action items of LMK/OOM.
> 1. Change LMK threshold of homescreen to 16MB to make sure homescreen can be
> killed at background.
done

> 2. Add the mechanism to kill application between LMK and OOM.
WIP-patch and log

> 3. Optimize the seek value in mm/vmscan.c in order to avoid LMK didn't work.
done

> 4. Make calculation of file cache more accurate
no conclusion yet. We think there are no easy ways to calculate this
(In reply to ying.xu from comment #70)
> Created attachment 8410738 [details] [diff] [review]
> use_lmk_strategy_in_oom.patch

I have run 12hours with this path

And I can see the path of oom has fall into kill_bad_process_lmk

04-23 02:39:16.621 <3>0[19321.450230] vmscan shrink_slab: bf02c80c nr=0 total_scan=7
04-23 02:39:16.621 <3>0[19321.450348] oom_killer send sigkill to 20216 (Homescreen), adj 8, size 4885
04-23 02:39:16.621 <4>0[19321.450405] b2g invoked oom-killer: gfp_mask=0x0, order=-1, oom_adj=0, oom_score_adj=0
04-23 02:39:16.621 <4>0[19321.450465] Backtrace:
04-23 02:39:16.621 <4>0[19321.450525] [<c4537ad8>] (dump_backtrace+0x0/0x110) from [<c488a11c>] (dump_stack+0x18/0x1c)
04-23 02:39:16.621 <4>0[19321.450591]  r7:c49c3c48 r6:00000000 r5:ffffffff r4:c1bbc000
04-23 02:39:16.621 <4>0[19321.450669] [<c488a104>] (dump_stack+0x0/0x1c) from [<c4599464>] (dump_header.clone.1+0x7c/0xac)
04-23 02:39:16.621 <4>0[19321.450745] [<c45993e8>] (dump_header.clone.1+0x0/0xac) from [<c4599830>] (kill_bad_process_lmk+0x130/0x158)
04-23 02:39:16.621 <4>0[19321.450892]  r6:00000008 r5:c65130e0 r4:00001315
04-23 02:39:16.621 <4>0[19321.450958] [<c4599700>] (kill_bad_process_lmk+0x0/0x158) from [<c45999d4>] (out_of_memory+0x17c/0x3b4)
04-23 02:39:16.621 <4>0[19321.451040] [<c4599858>] (out_of_memory+0x0/0x3b4) from [<c459d160>] (__alloc_pages_nodemask+0x484/0x598)
04-23 02:39:16.621 <4>0[19321.451127] [<c459ccdc>] (__alloc_pages_nodemask+0x0/0x598) from [<c45bc3d4>] (read_swap_cache_async+0x58/0x1b0)
04-23 02:39:16.621 <4>0[19321.451212] [<c45bc37c>] (read_swap_cache_async+0x0/0x1b0) from [<c45bc5b8>] (swapin_readahead+0x8c/0x94)
04-23 02:39:16.621 <4>0[19321.451299] [<c45bc52c>] (swapin_readahead+0x0/0x94) from [<c45ae98c>] (handle_pte_fault+0x2bc/0x6cc)
04-23 02:39:16.621 <4>0[19321.451379] [<c45ae6d0>] (handle_pte_fault+0x0/0x6cc) from [<c45af310>] (handle_mm_fault+0xd0/0xe4)
04-23 02:39:16.621 <4>0[19321.451460] [<c45af240>] (handle_mm_fault+0x0/0xe4) from [<c453b808>] (do_page_fault+0xe8/0x28c)
04-23 02:39:16.621 <4>0[19321.451547] [<c453b720>] (do_page_fault+0x0/0x28c) from [<c4528288>] (do_DataAbort+0x3c/0xa0)
04-23 02:39:16.621 <4>0[19321.451629] [<c452824c>] (do_DataAbort+0x0/0xa0) from [<c4533dcc>] (ret_from_exception+0x0/0x10)
04-23 02:39:16.621 <4>0[19321.451698] Exception stack(0xc1bbdfb0 to 0xc1bbdff8)
04-23 02:39:16.621 <4>0[19321.451745] dfa0:                                     434e31e0 44aca000 44aaa110 44aaa104
04-23 02:39:16.621 <4>0[19321.451815] dfc0: 434e31e0 00000001 bec784d0 46f9e050 46f9e040 00000002 00000001 46ed8430
04-23 02:39:16.621 <4>0[19321.451883] dfe0: 00000001 bec78458 41b6d531 41b6d2d6 20800030 ffffffff
04-23 02:39:16.621 <4>0[19321.451935]  r7:46f9e050 r6:bec784d0 r5:00000007 r4:0000040f
04-23 02:39:16.621 <4>0[19321.452001] Mem-info:
04-23 02:39:16.621 <4>0[19321.452026] Normal per-cpu:
04-23 02:39:16.621 <4>0[19321.452055] CPU    0: hi:   42, btch:   7 usd:   0
04-23 02:39:16.621 <4>0[19321.452107] active_anon:6858 inactive_anon:6888 isolated_anon:0
04-23 02:39:16.621 <4>0[19321.452115]  active_file:2108 inactive_file:2 isolated_file:6
04-23 02:39:16.621 <4>0[19321.452122]  unevictable:83 dirty:14 writeback:0 unstable:0
04-23 02:39:16.621 <4>0[19321.452130]  free:340 slab_reclaimable:263 slab_unreclaimable:1042
04-23 02:39:16.621 <4>0[19321.452138]  mapped:3061 shmem:1014 pagetables:439 bounce:0
04-23 02:39:16.646 ated(anon):0kB isolated(file):24kB present:114408kB mlocked:0kB dirty:56kB writeback:0kB mapped:12244kB shmem:4056kB slab_reclaimable:1052kB slab_unreclaimable:4168kB kernel_stack:2064kB pagetables:1756kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:596 all_unreclaimable? no
04-23 02:39:16.646 <4>0[19321.452609] lowmem_reserve[]: 0 0 0
04-23 02:39:16.646 <4>0[19321.452655] Normal: 34*4kB 1*8kB 0*16kB 4*32kB 9*64kB 4*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1360kB
04-23 02:39:16.646 <4>0[19321.452786] 3573 total pagecache pages
04-23 02:39:16.646 <4>0[19321.452819] 360 pages in swap cache
04-23 02:39:16.646 <4>0[19321.452853] Swap cache stats: add 3087750, delete 3087390, find 108563/1658455
04-23 02:39:16.646 <4>0[19321.452908] Free swap  = 27232kB
04-23 02:39:16.646 <4>0[19321.452937] Total swap = 65532kB
04-23 02:39:16.646 <4>0[19321.454546] 28854 pages of RAM
04-23 02:39:16.646 <4>0[19321.454579] 961 free pages
04-23 02:39:16.646 <4>0[19321.454606] 3755 reserved pages
04-23 02:39:16.646 <4>0[19321.454634] 940 slab pages
04-23 02:39:16.646 <4>0[19321.454660] 8694 pages shared
04-23 02:39:16.646 <4>0[19321.454687] 360 pages swap cached
04-23 02:39:16.646 <4>0[19321.454725] zram0 status unit(page):
04-23 02:39:16.646 <4>0[19321.454731]   mem_used_total:  3777
04-23 02:39:16.646 <4>0[19321.454737]   compr_data_size: 3523
04-23 02:39:16.646 <4>0[19321.454742]   orig_data_size:  9400
04-23 02:39:16.646 <4>0[19321.454748]   num_reads:       1549867
04-23 02:39:16.646 <4>0[19321.454754]   num_writes:      1536310
04-23 02:39:16.646 <6>0[19321.454908] stat                 pid         vss       rss  rss_mapped       pss       uss      swap
04-23 02:39:16.646 <6>0[19321.454975] -----------------------------------------------------------------
04-23 02:39:16.646 <6>0[19321.455047] S (sleeping)           1        376K      108K        108K       45K        0K      124K init
04-23 02:39:16.646 <6>0[19321.455139] S (sleeping)          65        312K       88K         88K       44K        0K       92K ueventd
04-23 02:39:16.646 <6>0[19321.455235] S (sleeping)          80        828K       92K         88K        7K        4K       84K servicemanager
04-23 02:39:16.646 <6>0[19321.455357] S (sleeping)          81       4020K       96K         96K        3K        0K      264K vold
04-23 02:39:16.646 <6>0[19321.455438] S (sleeping)          82        248K        0K          0K        0K        0K       20K vcharged
04-23 02:39:16.646 <6>0[19321.455528] S (sleeping)          83        680K       68K         64K        2K        0K       72K sprd_monitor
04-23 02:39:16.646 <6>0[19321.455623] S (sleeping)          87        820K       92K         88K        3K        0K       92K rilproxy
04-23 02:39:16.646 <6>0[19321.455768] S (sleeping)          89       7420K      112K        112K        4K        0K      288K netd
04-23 02:39:16.646 <6>0[19321.455856] S (sleeping)          90        692K       68K         64K        2K        0K       72K debuggerd
04-23 02:39:16.646 <6>0[19321.455954] S (sleeping)          92       1364K      104K        100K        3K        0K      148K dbus-daemon
04-23 02:39:16.646 <6>0[19321.456054] S (sleeping)          93       1732K       60K         56K        1K        0K      160K keystore
04-23 02:39:16.646 <6>0[19321.456152] S (sleeping)          94       1276K       84K         80K        2K        0K      176K nvm_daemon
04-23 02:39:16.646 <6>0[19321.456253] S (sleeping)          95       1844K       72K         72K        2K        0K       96K modemd
04-23 02:39:16.646 <6>0[19321.456344] S (sleeping)         101        820K       92K         88K        3K        0K       92K rilproxy
04-23 02:39:16.646 <6>0[19321.456439] S (sleeping)         104       1112K       68K         64K        2K        0K      116K engmoded
04-23 02:39:16.646 <6>0[19321.456555] S (sleeping)         112       4464K      132K        132K      128K      128K       68K adbd
04-23 02:39:16.646 <6>0[19321.456852] S (sleeping)         155      28716K       88K         88K        3K        0K      460K phoneserver_2si
04-23 02:39:16.646 <6>0[19321.457008] S (sleeping)         158       8412K       84K         84K        2K        0K      424K engpcclient
04-23 02:39:16.646 <6>0[19321.457112] S (sleeping)         160       1872K      100K        100K        4K        0K      116K engmodemclient
04-23 02:39:16.646 <6>0[19321.457215] S (sleeping)         165       1856K       88K         88K        3K        0K      100K engservice
04-23 02:39:16.646 <6>0[19321.457351] S (sleeping)         319       5304K      240K        240K      115K      108K       88K slog
04-23 02:39:16.646 <6>0[19321.457447] R (running)        19595        764K      232K        228K      103K       60K       28K oom-msg-logger
04-23 02:39:16.680 tach
04-23 02:39:16.680 <6>0[19321.457658] S (sleeping)       19895       2144K      132K        132K        7K        0K      268K bluetoothd
04-23 02:39:16.680 <6>0[19321.457751] S (sleeping)       19994        756K      172K        168K       43K        0K       76K sh
04-23 02:39:16.680 <6>0[19321.457840] S (sleeping)       19996        688K      128K        128K       39K       36K       52K orng
04-23 02:39:16.680 <6>0[19321.457989] S (sleeping)       20139       7628K      196K        196K       17K        4K      240K rild_sp
04-23 02:39:16.680 <6>0[19321.458151] S (sleeping)       20140       9676K      200K        200K       17K        4K      260K rild_sp
04-23 02:39:16.680 <6>0[19321.458409] S (sleeping)       20149      20444K      264K        264K       31K        4K     1168K mediaserver
04-23 02:39:16.680 <6>0[19321.460931] R (running)        20150     141124K    33132K      29880K    25160K    23360K    14868K b2g
04-23 02:39:16.680 <6>0[19321.461559] S (sleeping)       20202      51684K     2968K       2968K      582K        0K     8996K (Nuwa)
04-23 02:39:16.680 <6>0[19321.462579] R (running)        20216      66160K    10468K      10468K     6058K     4652K     9072K Homescreen
04-23 02:39:16.680 <6>0[19321.462697] S (sleeping)       20252       2504K      204K        204K       50K       40K      244K wpa_supplicant
04-23 02:39:16.680 <6>0[19321.464842] S (sleeping)       20254      96676K    33980K      31384K    26265K    24224K     5664K Camera
04-23 02:39:16.680 <6>0[19321.465758] R (running)        20349      59868K     8404K       8400K     4402K     3172K     5672K (Preallocated a
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → 1.4 S6 (25apr)
You need to log in before you can comment on or make changes to this bug.