Closed Bug 1005799 Opened 5 years ago Closed 2 years ago

[Tarako][monkey test] b2g was killed by LMK (fail rate: 1/80hrs)

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dliang, Unassigned)

References

Details

Attachments

(5 files)

Attached file system and kernel log
With new LMK parameters and OOM mechanism, we still met b2g was killed by LMK due to free swap is only 440KB in monkey test. The fail rate is low (1/80hrs) and it can be recovered by b2g restart.
There are something I found in log:
1. Communcation was killed w/ oom_adj=1 just before b2g was killed.
2. b2g has bigger RSS in this case.

kernel.log:
05-05 00:39:01.453 <4>0[14336.605437] lowmem_shrink select 844 (b2g), adj 0, size 27064, to kill
05-05 00:39:01.453 <4>0[14336.605512] lowmem_shrink send sigkill to 844 (b2g), adj 0, size 27064
05-05 00:39:01.453 <4>0[14336.605596] kswapd0 invoked lowmemorykiller: gfp_mask=0xd0, oom_adj=0, oom_score_adj=0
05-05 00:39:01.453 <4>0[14336.605657] Backtrace:
05-05 00:39:01.453 <4>0[14336.605711] [<c4537ad8>] (dump_backtrace+0x0/0x110) from [<c4889d8c>] (dump_stack+0x18/0x1c)
05-05 00:39:01.453 <4>0[14336.605778]  r7:c70f5f78 r6:000000d0 r5:c66977a0 r4:c70f4000
05-05 00:39:01.453 <4>0[14336.605860] [<c4889d74>] (dump_stack+0x0/0x1c) from [<c4779678>] (lowmem_shrink+0x48c/0x560)
05-05 00:39:01.453 <4>0[14336.605935] [<c47791ec>] (lowmem_shrink+0x0/0x560) from [<c45a4944>] (shrink_slab+0x114/0x1c0)
05-05 00:39:01.453 <4>0[14336.606008] [<c45a4830>] (shrink_slab+0x0/0x1c0) from [<c45a4f78>] (kswapd+0x588/0x924)
05-05 00:39:01.453 <4>0[14336.606079] [<c45a49f0>] (kswapd+0x0/0x924) from [<c456b7d0>] (kthread+0x8c/0x94)
05-05 00:39:01.453 <4>0[14336.606146] [<c456b744>] (kthread+0x0/0x94) from [<c45547a4>] (do_exit+0x0/0x5fc)
05-05 00:39:01.453 <4>0[14336.606204]  r7:00800013 r6:c45547a4 r5:c456b744 r4:c7029f20
05-05 00:39:01.453 <4>0[14336.606268] Mem-info:
05-05 00:39:01.453 <4>0[14336.606293] Normal per-cpu:
05-05 00:39:01.453 <4>0[14336.606322] CPU    0: hi:   42, btch:   7 usd:  36
05-05 00:39:01.453 <4>0[14336.606374] active_anon:6882 inactive_anon:6913 isolated_anon:0
05-05 00:39:01.453 <4>0[14336.606382]  active_file:492 inactive_file:210 isolated_file:0
05-05 00:39:01.453 <4>0[14336.606390]  unevictable:83 dirty:4 writeback:0 unstable:0
05-05 00:39:01.453 <4>0[14336.606397]  free:687 slab_reclaimable:262 slab_unreclaimable:1100
05-05 00:39:01.453 <4>0[14336.606405]  mapped:598 shmem:278 pagetables:340 bounce:0
05-05 00:39:01.453 <4>0[14336.606612] Normal free:2748kB min:1352kB low:2200kB high:2540kB active_anon:27528kB inactive_anon:27652kB active_file:1968kB inactive_file:840kB unevictable:332kB isolated(anon):0kB isolated(file):0kB present:114408kB mlocked:0kB dirty:16kB writeback:0kB mapped:2392kB shmem:1112kB slab_reclaimable:1048kB slab_unreclaimable:4400kB kernel_stack:1808kB pagetables:1360kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:3 all_unreclaimable? no
05-05 00:39:01.453 <4>0[14336.606875] lowmem_reserve[]: 0 0 0
05-05 00:39:01.453 <4>0[14336.606921] Normal: 435*4kB 68*8kB 15*16kB 7*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2748kB
05-05 00:39:01.453 <4>0[14336.607054] 3286 total pagecache pages
05-05 00:39:01.453 <4>0[14336.607088] 175 pages in swap cache
05-05 00:39:01.453 <4>0[14336.607122] Swap cache stats: add 3787894, delete 3787719, find 118844/1968152
05-05 00:39:01.453 <4>0[14336.607177] Free swap  = 440kB
05-05 00:39:01.453 <4>0[14336.607206] Total swap = 65532kB
05-05 00:39:01.453 <4>0[14336.608563] 28854 pages of RAM
05-05 00:39:01.453 <4>0[14336.608597] 1294 free pages
05-05 00:39:01.453 <4>0[14336.608624] 1707 reserved pages
05-05 00:39:01.453 <4>0[14336.608653] 989 slab pages
05-05 00:39:01.453 <4>0[14336.608679] 1161 pages shared
05-05 00:39:01.453 <4>0[14336.608706] 175 pages swap cached
05-05 00:39:01.453 <6>0[14336.608737] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
05-05 00:39:01.453 <6>0[14336.608812] [   65]     0    65       78       26   0     -16          -941 ueventd
05-05 00:39:01.453 <6>0[14336.608880] [   80]  1000    80      207        8   0     -16          -941 servicemanager
05-05 00:39:01.453 <6>0[14336.608949] [   81]     0    81     1005       40   0     -16          -941 vold
05-05 00:39:01.453 <6>0[14336.609012] [   82]  1000    82       62        6   0     -16          -941 vcharged
05-05 00:39:01.453 <6>0[14336.609077] [   83]  1000    83      170        4   0     -16          -941 sprd_monitor
05-05 00:39:01.453 <6>0[14336.609145] [   87]  1001    87      205        8   0     -16          -941 rilproxy
05-05 00:39:01.453 <6>0[14336.609210] [   88]     0    88     1855       40   0     -16          -941 netd
05-05 00:39:01.453 <6>0[14336.609273] [   89]     0    89      173        5   0     -16          -941 debuggerd
05-05 00:39:01.453 <6>0[14336.609339] [   90]  1013    90     6471       98   0     -16          -941 mediaserver
05-05 00:39:01.453 <6>0[14336.609406] [   91]  1002    91      341       22   0     -16          -941 dbus-daemon
05-05 00:39:01.453 <6>0[14336.609473] [   92]  1017    92      433        5   0     -16          -941 keystore
05-05 00:39:01.453 <6>0[14336.609538] [   93]     0    93      319       11   0     -16          -941 nvm_daemon
05-05 00:39:01.453 <6>0[14336.609604] [   94]  1000    94      461        5   0     -16          -941 modemd
05-05 00:39:01.453 <6>0[14336.609668] [   98]  1001    98      205        7   0     -16          -941 rilproxy
05-05 00:39:01.453 <6>0[14336.609733] [  100]     0   100      278        5   0     -16          -941 engmoded
05-05 00:39:01.454  0     -16          -941 adbd
05-05 00:39:01.454 <6>0[14336.609861] [  158]  1000   158     7179       18   0     -16          -941 phoneserver_2si
05-05 00:39:01.454 <6>0[14336.609931] [  162]  1000   162     2102       19   0     -16          -941 engpcclient
05-05 00:39:01.454 <6>0[14336.610149] [  163]  1000   163      468       21   0     -16          -941 engmodemclient
05-05 00:39:01.454 <6>0[14336.610227] [  165]  1000   165      464       19   0     -16          -941 engservice
05-05 00:39:01.454 <6>0[14336.610294] [  322]     0   322     1320       63   0     -16          -941 slog
05-05 00:39:01.454 <6>0[14336.610358] [  569]  1001   569     2419       31   0     -16          -941 rild_sp
05-05 00:39:01.454 <6>0[14336.610424] [  570]  1001   570     1651       26   0     -16          -941 rild_sp
05-05 00:39:01.454 <6>0[14336.610488] [  844]     0   844    53199    13717   0       0             0 b2g
05-05 00:39:01.454 <6>0[14336.610550] [  859]     0   859    12671      106   0     -16          -941 (Nuwa)
05-05 00:39:01.454 <6>0[14336.610615] [ 4265]  1010  4265      627       33   0     -16          -941 wpa_supplicant
05-05 00:39:01.454 <6>0[14336.610685] [15895]     0 15895      189       93   0       0             0 sh
05-05 00:39:01.454 <6>0[14336.610746] [15897]     0 15897      172       69   0       0             0 orng
05-05 00:39:01.454 <4>0[14336.610810] zram0 status unit(page):
05-05 00:39:01.454 <4>0[14336.610816]   mem_used_total:  5665
05-05 00:39:01.454 <4>0[14336.610822]   compr_data_size: 5470
05-05 00:39:01.454 <4>0[14336.610827]   orig_data_size:  16006
05-05 00:39:01.454 <4>0[14336.610833]   num_reads:       1849266
05-05 00:39:01.454 <4>0[14336.610839]   num_writes:      1936936

dannyliang@dannyliang-ubuntu-pc:~/work/tarako/0505_b2g_was_killed_by_lmk$ grep "lowmem_shrink send sigkill" ./kernel.log -nr | grep b2g -A5 -B5
50199:05-04 23:07:48.424 <4>0[ 8863.571512] lowmem_shrink send sigkill to 12831 (Homescreen), adj 8, size 6240
52335:05-04 23:13:16.041 <4>0[ 9191.191151] lowmem_shrink send sigkill to 12984 (Camera), adj 10, size 5396
53655:05-04 23:24:19.882 <4>0[ 9855.034193] lowmem_shrink send sigkill to 13004 (Homescreen), adj 8, size 5402
58445:05-05 00:08:50.722 <4>0[12525.873967] lowmem_shrink send sigkill to 13238 (Communications), adj 1, size 5536
60862:05-05 00:31:53.692 <4>0[13908.843384] lowmem_shrink send sigkill to 13249 ((Preallocated a), adj 1, size 3027
61713:05-05 00:39:01.453 <4>0[14336.605512] lowmem_shrink send sigkill to 844 (b2g), adj 0, size 27064
61919:05-05 00:39:48.974 <4>0[14384.126738] lowmem_shrink send sigkill to 16041 (Usage), adj 11, size 5516
62445:05-05 00:42:46.467 <4>0[14561.619169] lowmem_shrink send sigkill to 16138 (Usage), adj 10, size 4469
62706:05-05 00:44:11.252 <4>0[14646.401318] lowmem_shrink send sigkill to 16070 (Homescreen), adj 8, size 7680
62899:05-05 00:45:01.706 <4>0[14696.858448] lowmem_shrink send sigkill to 16460 (Video), adj 10, size 7202
63040:05-05 00:45:32.211 <4>0[14727.360789] lowmem_shrink send sigkill to 16546 (Homescreen), adj 8, size 6086

Actual results:
B2g was killed by LMK in tarako

Expected results:
No b2g was killed in tarako
dannyliang@dannyliang-PORTEGE-R930:/usr/share/ca-certificates$ adb shell b2g-info
                         |     megabytes     |
           NAME PID PPID CPU(s) NICE  USS  PSS  RSS SWAP VSIZE OOM_ADJ USER   
            b2g  86    1   24.9    0 34.4 38.5 46.9  5.1 135.6       0 root   
         (Nuwa) 338   86    1.1    0  1.9  4.8 11.9  4.1  49.4       0 root   
     Homescreen 378  338    3.6    1 12.5 16.9 26.4  2.1  80.7       2 app_378
(Preallocated a 507  338    0.8   18  5.4  8.6 16.6  2.2  58.4       1 root
Pre-discussed with Ting-Yuan, we'd like to know the pre-process swap information when issue happened.

Hi James, could you apply the attached patch for b2g-info and get b2g-info every 5 seconds in monkey test?
Flags: needinfo?(james.zhang)
The key logs are at below:

58445:05-05 00:08:50.722 <4>0[12525.873967] lowmem_shrink send sigkill to 13238 (Communications), adj 1, size 5536
05-05 00:08:50.723 <4>0[12525.879488] zram0 status unit(page):
05-05 00:08:50.723 <4>0[12525.879494]  	mem_used_total:  6034 
05-05 00:08:50.723 <4>0[12525.879499]  	compr_data_size: 5850 
05-05 00:08:50.723 <4>0[12525.879505]  	orig_data_size:  15800 
05-05 00:08:50.723 <4>0[12525.879511]  	num_reads:       1849111 
05-05 00:08:50.723 <4>0[12525.879517]  	num_writes:      1932943

60862:05-05 00:31:53.692 <4>0[13908.843384] lowmem_shrink send sigkill to 13249 ((Preallocated a), adj 1, size 3027
05-05 00:31:53.693 <4>0[13908.848650] zram0 status unit(page):
05-05 00:31:53.693 <4>0[13908.848656]  	mem_used_total:  5709 
05-05 00:31:53.693 <4>0[13908.848662]  	compr_data_size: 5518 
05-05 00:31:53.693 <4>0[13908.848668]  	orig_data_size:  15853 
05-05 00:31:53.693 <4>0[13908.848673]  	num_reads:       1849252 
05-05 00:31:53.693 <4>0[13908.848679]  	num_writes:      1936419 

61713:05-05 00:39:01.453 <4>0[14336.605512] lowmem_shrink send sigkill to 844 (b2g), adj 0, size 27064
05-05 00:39:01.454 <4>0[14336.610810] zram0 status unit(page):
05-05 00:39:01.454 <4>0[14336.610816]  	mem_used_total:  5665 
05-05 00:39:01.454 <4>0[14336.610822]  	compr_data_size: 5470 
05-05 00:39:01.454 <4>0[14336.610827]  	orig_data_size:  16006 
05-05 00:39:01.454 <4>0[14336.610833]  	num_reads:       1849266 
05-05 00:39:01.454 <4>0[14336.610839]  	num_writes:      1936936
Ying has reverted "si.freeswap" patch.


commit 2c8ccc9f0a4d1b3a6f2cf0ad10eccdea6c574bed
Author: ying.xu <ying.xu@spreadtrum.com>
Date:   Mon May 5 19:03:55 2014 +0800

    Bug#303502 revert trigger memory pressure notify when free memory of swap low
    
    [self test   ]  OK ,mozilla bug 1005799
    
    Change-Id: Ia3d04590eb8fe3f804a38c05e83df9ac3c21eb94

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index f29791b..beda389 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -222,9 +222,8 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
        }
 
         if (sc->nr_to_scan > 0 &&
-                        ((other_free < lowmem_minfree_notif_trigger &&
-                        other_file < lowmem_minfree_notif_trigger) ||
-                       (si.freeswap << (PAGE_SHIFT - 10)) < lowmem_minfree_notif_trigger)) {
+                        (other_free < lowmem_minfree_notif_trigger &&
+                        other_file < lowmem_minfree_notif_trigger)) {
                 lowmem_notify_killzone_approach();
         }
Flags: needinfo?(james.zhang)
(In reply to James Zhang from comment #4)
> Ying has reverted "si.freeswap" patch.
> 
> 
> commit 2c8ccc9f0a4d1b3a6f2cf0ad10eccdea6c574bed
> Author: ying.xu <ying.xu@spreadtrum.com>
> Date:   Mon May 5 19:03:55 2014 +0800
> 
>     Bug#303502 revert trigger memory pressure notify when free memory of
> swap low
>     
>     [self test   ]  OK ,mozilla bug 1005799
>     
>     Change-Id: Ia3d04590eb8fe3f804a38c05e83df9ac3c21eb94
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c
> b/drivers/staging/android/lowmemorykiller.c
> index f29791b..beda389 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -222,9 +222,8 @@ static int lowmem_shrink(struct shrinker *s, struct
> shrink_control *sc)
>         }
>  
>          if (sc->nr_to_scan > 0 &&
> -                        ((other_free < lowmem_minfree_notif_trigger &&
> -                        other_file < lowmem_minfree_notif_trigger) ||
> -                       (si.freeswap << (PAGE_SHIFT - 10)) <
> lowmem_minfree_notif_trigger)) {
> +                        (other_free < lowmem_minfree_notif_trigger &&
> +                        other_file < lowmem_minfree_notif_trigger)) {
>                  lowmem_notify_killzone_approach();
>          }

Is this the root cause of b2g was killed by LMK?
If we don't trigger memory pressure event when low swapfree, system will be lag at that time.
Flags: needinfo?(ying.xu)
Flags: needinfo?(james.zhang)
Android LMK don't have these code, so I don't it won't cause lag.
Flags: needinfo?(james.zhang)
Android LMK don't have these code, so I don't think it will cause lag.
The root cause is b2g consuming too much memory.
But there is a bug in LMK. See code below.

the check condition when lowmem_notify_killzone_approach was different from lowmem_notify_trigger_active_show
If free swap was lower than lowmem_minfree_notif_trigger, lowmem_notify_killzone_approach  checked.
But lowmem_notify_trigger_active_show not,which means gecko was not notified a memory-pressure event when free swap was low.
So I revert the change.

        if (sc->nr_to_scan > 0 &&
                        ((other_free < lowmem_minfree_notif_trigger &&
                        other_file < lowmem_minfree_notif_trigger) ||
                        (si.freeswap << (PAGE_SHIFT - 10)) < lowmem_minfree_notif_trigger)) {  //here
                lowmem_notify_killzone_approach();
        }

static ssize_t lowmem_notify_trigger_active_show(struct kobject *k,
                struct kobj_attribute *attr, char *buf)
{
        int other_free, other_file;
        lowmem_print(5, "lowmem_shrink notify_trigger_active_show\n");
        get_free_ram(&other_free, &other_file);
        if (other_free < lowmem_minfree_notif_trigger &&
                        other_file < lowmem_minfree_notif_trigger)  //here
                return snprintf(buf, 3, "1\n");
        else
                return snprintf(buf, 3, "0\n");
}
Flags: needinfo?(ying.xu)
b2g total memory rss+swap 27064
05-05 00:39:01.453 <4>0[14336.605437] lowmem_shrink select 844 (b2g), adj 0, size 27064, to kill
05-05 00:39:01.453 <4>0[14336.605512] lowmem_shrink send sigkill to 844 (b2g), adj 0, size 27064

rss only 13717 about 55M
05-05 00:39:01.454 <6>0[14336.610488] [  844]     0   844    53199    13717   0       0             0 b2g

swap
27064 - 13717 = 13347 pages

the zram compress ratio was about 3:1, which mean real memory utilization of b2g swap was about 17M
05-05 00:39:01.454 <4>0[14336.610822]   compr_data_size: 5470
05-05 00:39:01.454 <4>0[14336.610827]   orig_data_size:  16006

total b2g memory utilization was about 72M
(In reply to Danny Liang [:dliang] from comment #0)
> 05-05 00:39:01.453 <4>0[14336.606612] Normal free:2748kB min:1352kB
> low:2200kB high:2540kB active_anon:27528kB inactive_anon:27652kB

active_anon + inactive_anon = 55MB.  But, the total of RSS of all processes is only 14.3MB.  So, how does the anonymous memory comes so big?
(In reply to Thinker Li [:sinker] from comment #10)
 But, the total of RSS of all processes
> is only 14.3MB.  So, how does the anonymous memory comes so big?

I think 14.3 MB was wrong for the total of RSS of all processes
Can you tell us the way you calculate this?
You could calculate sum of RSS from the log of comment 0.
(In reply to Thinker Li [:sinker] from comment #12)
> You could calculate sum of RSS from the log of comment 0.

05-05 00:39:01.453 <6>0[14336.608737] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
the unit of rss is page, not KB
(In reply to ying.xu from comment #13)
> (In reply to Thinker Li [:sinker] from comment #12)
> > You could calculate sum of RSS from the log of comment 0.
> 
> 05-05 00:39:01.453 <6>0[14336.608737] [ pid ]   uid  tgid total_vm      rss
> cpu oom_adj oom_score_adj name
> the unit of rss is page, not KB
Ok! Make sense.
After the discussion with Ting-Yaun, 212MB of b2g process seems too big.  Could we run a monkey test on a device with 256MB of DRAM to figure out if it still happen and run get_about_memory.sh?
Pre-discussed w/ Ting-Yuan, we'd like provide a way to get about memory when issue happened, the idea as below:
1. When LMK choose b2g process to be a victim, killed orng (process of monkey test) instead of b2g.
2. Add some check in script of monkey test, if the orng was killed, stop monkey test and run get_about_memory to get memory information we want.
I have a  STR with 30% possibilities to reproduce this bug.

make a call, during the first call, make another call

or receive a call , during the first call, receive another call
(In reply to ying.xu from comment #17)
> I have a  STR with 30% possibilities to reproduce this bug.
> 
> make a call, during the first call, make another call
> 
> or receive a call , during the first call, receive another call

Hi Ying, I cannot reproduce on my side in both MT/MO call by your steps. If the reproduce rate is 30% with your STR, could you help to reproduce it and try to get the report of about memory?
When you reproduce this issue, please apply my kernel patch and it can avoid LMK to kill b2g, then you can use below command to get memory report.

How to get about memory:
cd tools
MOZ_IGNORE_NUWA_PROCESS=1 ./get_about_memory.py --no-gc-cc-log
Attached file tools.tar.gz
tools folder included ./get_about_memory
Flags: needinfo?(ying.xu)
try many times with nightly build. not happen yet

But we do have some bugs .Please refer

http://bugzilla.spreadtrum.com/bugzilla/show_bug.cgi?id=306280
Flags: needinfo?(ying.xu)
(In reply to ying.xu from comment #21)
> try many times with nightly build. not happen yet
> 
> But we do have some bugs .Please refer
> 
> http://bugzilla.spreadtrum.com/bugzilla/show_bug.cgi?id=306280

I have checked the log on ftp://10.0.0.67/FromPlatformSW/PLDSQA/FirefoxOS/TestLog/6821/0426/会议电话定屏, but I didn't find b2g was killed by LMK. Could you share the log of \\shextnas1\spdlogs\Logs\Firefox\wen\0505\三方通话定屏 on bugzilla? Thanks,
Flags: needinfo?(ying.xu)
(In reply to Danny Liang [:dliang] from comment #22)

> but I didn't find b2g was killed by LMK. Could you share the log of
> \\shextnas1\spdlogs\Logs\Firefox\wen\0505\三方通话定屏 on bugzilla? Thanks,

The log seem not very useful and huge(55M).

You can refer the duplicated bugs, we have added some preliminary analysis.
http://bugzilla.spreadtrum.com/bugzilla/show_bug.cgi?id=308134
http://bugzilla.spreadtrum.com/bugzilla/show_bug.cgi?id=307045
Flags: needinfo?(ying.xu)
(In reply to Danny Liang [:dliang] from comment #22)

> \\shextnas1\spdlogs\Logs\Firefox\wen\0505\三方通话定屏 on bugzilla? Thanks,

ftp://pdtest@ftp.spreadtrum.com/PDLogs/FFOS/2014/FIREFOXOS_V1.3_W14.19.1_SP6821/三方通话定屏.zip
This patch can trigger get_about_memory when b2g was be selected by LMK. When b2g was selected, we will kill orng to stop monkey test, stop LMK and set /sys/module/lowmemorykiller/parameters/b2g_was_victim to be 1. Monkey test script will monitor the value of b2g_was_victim, get_about_memory will be triggerred when b2g_was_victim is 1.

How to install it:
1. Apply the kernel patch 0507_kill_orng_instead_of_b2g.patch and rebuild boot.img
2. Put tools folder into the root of monkey test.
3. Replece the _test.sh in monkey test script.

Run monkey test w/ above settings, then you will get report of about memory when b2g was selected by LMK.
Attachment #8418610 - Flags: feedback?(ying.xu)
Attachment #8418610 - Flags: feedback?(james.zhang)
(In reply to Danny Liang [:dliang] from comment #25)
> Run monkey test w/ above settings, then you will get report of about memory
> when b2g was selected by LMK.

I don't think we can get the memory status at the point of LMK going to kill b2g
The script running on PC read memory-status every 60s.
Even LMK won't kill b2g, OOM will in a very short time. 60s is too long.
(In reply to ying.xu from comment #26)
> (In reply to Danny Liang [:dliang] from comment #25)
> > Run monkey test w/ above settings, then you will get report of about memory
> > when b2g was selected by LMK.
> 
> I don't think we can get the memory status at the point of LMK going to kill
> b2g
> The script running on PC read memory-status every 60s.
> Even LMK won't kill b2g, OOM will in a very short time. 60s is too long.

In these case, LMK was triggered due to lower other_file and lower sawp_free. If we stop LMK, will OOM be trigger due to other_file or swap_free is low?
If we cannot get memory status by current script, do you have any suggestion?
I think b2g process can register a signal handler to process signal 9
And in the handle function, dump all memory status.
I don't know the difficulty of work . It's only my suggestion
Ying will setup monkey test the same as you.
Attachment #8418610 - Flags: feedback?(james.zhang) → feedback+
Depends on: 1007520
(In reply to ying.xu from comment #17)
> I have a  STR with 30% possibilities to reproduce this bug.
> 
> make a call, during the first call, make another call
> 
> or receive a call , during the first call, receive another call

have more possibilities to reproduce this if using a bluetooth earphone at the first time
I have reproduced this bug once by establishing 2 way calls using a bluetooth earphone.

I found kernel was normal. app processes were normal.
The screen was stuck and no response to touchscreen and key input.

b2g took 95% CPU time by top command
using gdb to debug b2g, got some stacks
It seemed b2g was doing reflow in a endless loop(not sure about this)

#0  0x40fa2dce in ~nsRefPtr (this=0xbe9a523c, __in_chrg=<value optimized out>) at ../../dist/include/nsAutoPtr.h:886
#1  ~nsFont (this=0xbe9a523c, __in_chrg=<value optimized out>) at /home/ffos/yingxu/6821/gecko/gfx/src/nsFont.cpp:103
#2  0x41662748 in nsLayoutUtils::GetFontMetricsForStyleContext (aStyleContext=<value optimized out>, 
    aFontMetrics=0xbe9a52a4, aInflation=1) at /home/ffos/yingxu/6821/gecko/layout/base/nsLayoutUtils.cpp:2640
#3  0x416934da in ComputeLineHeight (aStyleContext=0x43871d50, aBlockHeight=<value optimized out>, 
    aFontSizeInflation=1) at /home/ffos/yingxu/6821/gecko/layout/generic/nsHTMLReflowState.cpp:2440
#4  nsHTMLReflowState::CalcLineHeight (aStyleContext=0x43871d50, aBlockHeight=<value optimized out>, 
    aFontSizeInflation=1) at /home/ffos/yingxu/6821/gecko/layout/generic/nsHTMLReflowState.cpp:2463
#5  0x416935ac in nsHTMLReflowState::CalcLineHeight (this=<value optimized out>)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsHTMLReflowState.cpp:2452
#6  0x4167852e in nsBlockReflowState (this=0xbe9a53f8, aReflowState=..., aPresContext=<value optimized out>, 
    aFrame=<value optimized out>, aTopMarginRoot=true, aBottomMarginRoot=true, aBlockNeedsFloatManager=true, 
    aConsumedHeight=0) at /home/ffos/yingxu/6821/gecko/layout/generic/nsBlockReflowState.cpp:109
#7  0x41676dea in nsBlockFrame::Reflow (this=0x470c1d40, aPresContext=0x46a54000, aMetrics=..., aReflowState=..., 
    aStatus=@0xbe9a57b0) at /home/ffos/yingxu/6821/gecko/layout/generic/nsBlockFrame.cpp:1017
#8  0x4167020e in nsAbsoluteContainingBlock::ReflowAbsoluteFrame (this=<value optimized out>, 
    aDelegatingFrame=0x46fa4750, aPresContext=0x46a54000, aReflowState=..., aContainingBlock=..., 
    aConstrainHeight=true, aKidFrame=0x470c1d40, aStatus=@0xbe9a57b0, aOverflowAreas=0xbe9a5b7c)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsAbsoluteContainingBlock.cpp:415
#9  0x416706b6 in nsAbsoluteContainingBlock::Reflow (this=<value optimized out>, aDelegatingFrame=0x46fa4750, 
    aPresContext=0x46a54000, aReflowState=..., aReflowStatus=@0xbe9a5b14, aContainingBlock=..., 
    aConstrainHeight=true, aCBWidthChanged=true, aCBHeightChanged=true, aOverflowAreas=0xbe9a5b7c)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsAbsoluteContainingBlock.cpp:137
#10 0x41683b94 in nsFrame::ReflowAbsoluteFrames (this=0x46fa4750, aPresContext=0x46a54000, 
    aDesiredSize=<value optimized out>, aReflowState=..., aStatus=@0xbe9a5b14, aConstrainHeight=true)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsFrame.cpp:4211
#11 0x416884e6 in nsFrame::FinishReflowWithAbsoluteFrames (this=0x46fa4750, aPresContext=0x0, aDesiredSize=..., 
    aReflowState=..., aStatus=@0xbe9a5b14, aConstrainHeight=<value optimized out>)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsFrame.cpp:4178
#12 0x4167a4e8 in nsCanvasFrame::Reflow (this=0x46fa4750, aPresContext=0x46a54000, aDesiredSize=..., 
    aReflowState=..., aStatus=@0xbe9a5b14) at /home/ffos/yingxu/6821/gecko/layout/generic/nsCanvasFrame.cpp:582
#13 0x4167ca04 in nsContainerFrame::ReflowChild (this=<value optimized out>, aKidFrame=0x46fa4750, 
    aPresContext=0x46a54000, aDesiredSize=..., aReflowState=..., aX=0, aY=0, aFlags=3, aStatus=@0xbe9a5b14, 
    aTracker=0x0) at /home/ffos/yingxu/6821/gecko/layout/generic/nsContainerFrame.cpp:962
#14 0x4168e866 in nsHTMLScrollFrame::ReflowScrolledFrame (this=0x46fa4928, aState=0xbe9a5c0c, 
    aAssumeHScroll=<value optimized out>, aAssumeVScroll=<value optimized out>, aMetrics=0xbe9a5b50, aFirstPass=true)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsGfxScrollFrame.cpp:459
#15 0x4168eb1c in nsHTMLScrollFrame::ReflowContents (this=0x46fa4928, aState=0xbe9a5c0c, 
    aDesiredSize=<value optimized out>) at /home/ffos/yingxu/6821/gecko/layout/generic/nsGfxScrollFrame.cpp:557
#16 0x4168ffa4 in nsHTMLScrollFrame::Reflow (this=0x46fa4928, aPresContext=<value optimized out>, aDesiredSize=..., 
    aReflowState=..., aStatus=@0xbe9a60e4) at /home/ffos/yingxu/6821/gecko/layout/generic/nsGfxScrollFrame.cpp:795
#17 0x4167ca04 in nsContainerFrame::ReflowChild (this=<value optimized out>, aKidFrame=0x46fa4928, 
    aPresContext=0x46a54000, aDesiredSize=..., aReflowState=..., aX=0, aY=0, aFlags=0, aStatus=@0xbe9a60e4, 
---Type <return> to continue, or q <return> to quit---
    aTracker=0x0) at /home/ffos/yingxu/6821/gecko/layout/generic/nsContainerFrame.cpp:962
#18 0x416b12f8 in ViewportFrame::Reflow (this=0x46fa4298, aPresContext=0x46a54000, aDesiredSize=..., 
    aReflowState=..., aStatus=@0xbe9a60e4) at /home/ffos/yingxu/6821/gecko/layout/generic/nsViewportFrame.cpp:222
#19 0x41627bca in PresShell::DoReflow (this=0x46a19a60, target=0x0, aInterruptible=<value optimized out>)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:8148
#20 0x4162b47c in PresShell::ProcessReflowCommands (this=0x46a19a60, aInterruptible=false)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:8304
#21 0x4162c0d6 in PresShell::FlushPendingNotifications (this=0x46a19a60, aFlush=...)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:4055
#22 0x41625a96 in PresShell::FlushPendingNotifications (this=<value optimized out>, aType=<value optimized out>)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:3901
#23 0x413c09d6 in nsDocument::FlushPendingNotifications (this=0x48433800, aType=Flush_Layout)
    at /home/ffos/yingxu/6821/gecko/content/base/src/nsDocument.cpp:7199
#24 0x413e1bf6 in mozilla::dom::Element::GetPrimaryFrame (this=0x47d307e0, aType=Flush_Content)
    at /home/ffos/yingxu/6821/gecko/content/base/src/Element.cpp:1579
#25 0x413e3a4a in mozilla::dom::Element::GetBoundingClientRect (this=0x47d307e0)
    at /home/ffos/yingxu/6821/gecko/content/base/src/Element.cpp:653
#26 0x410aa60c in getBoundingClientRect (cx=0x42d82320, obj=..., self=0x47d307e0, args=...)
    at /home/ffos/yingxu/6821/objdir-gecko/dom/bindings/ElementBinding.cpp:1197
#27 0x43d11404 in ?? ()
#28 0x43d11404 in ?? ()


#0  memmove () at bionic/libc/arch-arm/bionic/memmove.S:253
#1  0x416633a0 in nsLayoutUtils::ChangeMatrixBasis (aOrigin=..., aMatrix=<value optimized out>)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsLayoutUtils.cpp:1614
#2  0x41657082 in nsDisplayTransform::GetResultingTransformMatrixInternal (aProperties=..., aOrigin=..., 
    aAppUnitsPerPixel=60, aBoundsOverride=0x0, aOutAncestor=0xbe9a5e3c)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsDisplayList.cpp:4256
#3  0x416570cc in nsDisplayTransform::GetResultingTransformMatrix (aFrame=<value optimized out>, aOrigin=..., 
    aAppUnitsPerPixel=60, aBoundsOverride=0x0, aOutAncestor=0xbe9a5e3c)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsDisplayList.cpp:4163
#4  0x41685eb6 in nsIFrame::GetTransformMatrix (this=0x47990828, aStopAtAncestor=0x456f8298, aOutAncestor=0xbe9a5e3c)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsFrame.cpp:4590
#5  0x41662b10 in nsLayoutUtils::GetTransformToAncestor (aFrame=<value optimized out>, aAncestor=0x456f8298)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsLayoutUtils.cpp:1789
#6  0x4163547c in PredictScaleForContent (aFrame=<value optimized out>, aAncestorWithScale=0x456f8298, 
    aScale=<value optimized out>) at /home/ffos/yingxu/6821/gecko/layout/base/FrameLayerBuilder.cpp:3325
#7  0x416355b0 in mozilla::FrameLayerBuilder::GetThebesLayerScaleForFrame (aFrame=0x42d47890)
    at /home/ffos/yingxu/6821/gecko/layout/base/FrameLayerBuilder.cpp:3373
#8  0x41690fc2 in mozilla::ScrollFrameHelper::ScrollToImpl (this=0x42d47730, aPt=..., aRange=..., aOrigin=0x43823360)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsGfxScrollFrame.cpp:2009
#9  0x41691b4c in mozilla::ScrollFrameHelper::ReflowFinished (this=0x42d47730)
    at /home/ffos/yingxu/6821/gecko/layout/generic/nsGfxScrollFrame.cpp:3787
#10 0x4162af20 in PresShell::HandlePostedReflowCallbacks (this=0x46a19a60, aInterruptible=false)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:3860
#11 0x4162af6a in PresShell::DidDoReflow (this=0x46a19a60, aInterruptible=false, aWasInterrupted=false)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:7960
#12 0x4162b4c6 in PresShell::ProcessReflowCommands (this=0x46a19a60, aInterruptible=false)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:8316
#13 0x4162c0d6 in PresShell::FlushPendingNotifications (this=0x46a19a60, aFlush=...)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:4055
#14 0x41625a96 in PresShell::FlushPendingNotifications (this=<value optimized out>, aType=<value optimized out>)
    at /home/ffos/yingxu/6821/gecko/layout/base/nsPresShell.cpp:3901
#15 0x413c09d6 in nsDocument::FlushPendingNotifications (this=0x48433800, aType=Flush_Layout)
    at /home/ffos/yingxu/6821/gecko/content/base/src/nsDocument.cpp:7199
#16 0x413e1bf6 in mozilla::dom::Element::GetPrimaryFrame (this=0x47d307e0, aType=Flush_Content)
    at /home/ffos/yingxu/6821/gecko/content/base/src/Element.cpp:1579
#17 0x413e3a4a in mozilla::dom::Element::GetBoundingClientRect (this=0x47d307e0)
    at /home/ffos/yingxu/6821/gecko/content/base/src/Element.cpp:653
#18 0x410aa60c in getBoundingClientRect (cx=0x42d82320, obj=..., self=0x47d307e0, args=...)
    at /home/ffos/yingxu/6821/objdir-gecko/dom/bindings/ElementBinding.cpp:1197
After the screen was stuck, b2g still consumed memory. And get killed at the end.
Update the test result w/ patches 
3 devices all met b2g was killed, and b2g in two devices are restarted.
No OOM in 3 devices even if we stop LMK when b2g was selected to be killed.
One device keeps the environment and memory status as below:

dannyliang@dannyliang-ubuntu-pc:~/work/tarako/monkey.test/tools$ adb shell b2g-info
                |      megabytes     |
  NAME PID PPID  CPU(s) NICE  USS  PSS  RSS SWAP VSIZE OOM_ADJ USER
   b2g  85    1 41280.1    0 40.2 40.3 40.5 48.9 192.7       0 root
(Nuwa) 336   85   312.9    0  0.0  0.1  0.2  8.6  45.8       0 root

I cannot get memory report by ./get_about_memory in this case, the error message as below:
dannyliang@dannyliang-ubuntu-pc:~/work/tarako/monkey.test/tools$ MOZ_IGNORE_NUWA_PROCESS=1 ./get_about_memory.py --no-gc-cc-log
Got 0/1 files.
We've waited 120s but the only relevant files we see are

We expected 1 but see only 0 files.  Giving up...
Traceback (most recent call last):
  File "./get_about_memory.py", line 294, in <module>
    main()
  File "./get_about_memory.py", line 291, in main
    get_and_show_info(args)
  File "./get_about_memory.py", line 184, in get_and_show_info
    (out_dir, merged_reports_path, dmd_files) = get_dumps(args)
  File "./get_about_memory.py", line 181, in get_dumps
    return utils.run_and_delete_dir_on_exception(do_work, out_dir)
  File "/home/dannyliang/work/tarako/monkey.test/tools/include/device_utils.py", line 147, in run_and_delete_dir_on_exception
    return fun()
  File "./get_about_memory.py", line 164, in do_work
    optional_outfiles_prefixes=['dmd-'])
  File "/home/dannyliang/work/tarako/monkey.test/tools/include/device_utils.py", line 266, in notify_and_pull_files
    raise Exception("Unable to pull some files.")
Exception: Unable to pull some files.
I tried to have monkey test on master, the result showed 4 devices all met crash and b2g was restarted.
(In reply to Danny Liang [:dliang] from comment #34)
> I tried to have monkey test on master, the result showed 4 devices all met
> crash and b2g was restarted.

Moreover, all devices also met b2g was killed by LMK due to low swap free.
Update more detail of monkey test on master:
DUT1: b2g was killed by LMK at 4.8hrs after test start
DUT2: b2g was killed by LMK at 4.9hrs after test start
DUT3: b2g was killed by LMK at 4.2hrs after test start
DUT4: b2g was killed by LMK at 5.1hrs after test start
Do you ever run monkey test on devices with bigger memory configuration, for example, unagi?  I guess the fail of get_about_memory is caused by running out of memory for the action (get_about_memory).  I suggest to try it on devices with bigger memory configuration, and wait for b2g process to grow to 19xMB of VSIZE. Then, try to run get_about_memory.
Attachment #8418610 - Flags: feedback?(ying.xu)
very low possibilities now

the situation of killing b2g is something like below
b2g took at least 70M memory

01-11 19:35:21.835 <3>0[85883.792598] oom-killer use lmk strategy, send sigkill to 617 (b2g), adj 0, size 18860
01-11 19:35:21.835 <4>0[85883.792624] Compositor invoked oom-killer: gfp_mask=0x4286d2, order=0, oom_adj=0, oom_score_adj=0

01-14 13:27:33.834 <3>0[69355.764161] oom-killer use lmk strategy, send sigkill to 651 (b2g), adj 0, size 18369
01-14 13:27:33.834 <4>0[69355.764186] Compositor invoked oom-killer: gfp_mask=0x4286d2, order=0, oom_adj=0, oom_score_adj=0

02-13 03:46:49.037 <3>0[36571.997641] oom-killer use lmk strategy, send sigkill to 601 (b2g), adj 0, size 17738
02-13 03:46:49.037 <4>0[36571.997668] Compositor invoked oom-killer: gfp_mask=0x4286d2, order=0, oom_adj=0,
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.