Closed Bug 1081577 Opened 5 years ago Closed 4 years ago

[Performance][Dialer] lmk.js is wrong in Flame base image

Categories

(Firefox OS Graveyard :: Vendcom, defect, P2)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.2+, firefox44 unaffected, b2g-v2.0 unaffected, b2g-v2.1 affected, b2g-v2.2 affected, b2g-master unaffected)

RESOLVED FIXED
blocking-b2g 2.2+
Tracking Status
firefox44 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.1 --- affected
b2g-v2.2 --- affected
b2g-master --- unaffected

People

(Reporter: Marty, Assigned: cyu)

References

()

Details

(Keywords: regression, Whiteboard: [2.1-exploratory-3][POVB])

Attachments

(4 files)

Attached file logcat-Call-Delay.txt
Description:
If the User has several apps open (4-5), when they receive a phone call, the ringtone will play properly, but the Incoming Call UI takes up to 5 seconds to appear and allow the user to accept or deny the call.

This issue seems to be more severe when the user is viewing a app in landscape mode at the time of the call.
   
Repro Steps:
1) Update a Flame device to BuildID: 20141011000201
2) Open several apps (Browser, Gallery, Messaging, Contacts, Settings, Marketplace)
3) View the Browser app in landscape mode.
4) Call the DUT from another phone.
  
Actual:
Call UI appeared 5 seconds after ringtone began playing.

Expected: 
Call UI appears immediately when the ringtone begins playing.
  
Environmental Variables:
Device: Flame 2.1 (319MB)
BuildID: 20141011000201 (Full Flash)
Gaia: f5d4ff60ffed8961f7d0380ada9d0facfdfd56b1
Gecko: d813d79d3eae
Gonk: 52c909e821d107d414f851e267dedcd7aae2cebf
Version: 34.0a2 (2.1)
Firmware: V180
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0
  
Notes:
  
Repro frequency: 5/5
See attached: video clip (URL), logcat

------------------

This issue DOES occur on Flame 2.2
Call UI takes 5 seconds to appear after the ringtone begins playing when there are multiple apps open.

Environmental Variables:
Device: Flame 2.2 Master (319MB)
BuildID: 20141011040204 (Full Flash)
Gaia: 95f580a1522ffd0f09302372b78200dab9b6f322
Gecko: 3f6a51950eb5
Gonk: 52c909e821d107d414f851e267dedcd7aae2cebf
Version: 35.0a1 (2.2 Master)
Firmware: V180
User Agent: Mozilla/5.0 (Mobile; rv:35.0) Gecko/35.0 Firefox/35.0
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(dharris)
Whiteboard: [2.1-Daily-Testing] → [2.1-exploratory-3]
[Blocking Requested - why for this release]:

that video is bad.

marty, how bad is the performance when its in portrait mode?   having a unactionable delay this long on incoming call is unacceptable to a user.

Also, does this reproduce on the raw base image v180? (which is 2.0)

flagging for blocking and necessary investigation.
blocking-b2g: --- → 2.1?
Keywords: qawanted
I was able to repro this issue on the reporter's 2.2 build - out of 20 trials the average seemed 2 or 3 seconds but there were several instances of 4 and 5 second delays. 

This issue DOES NOT repro on 512 mem
------------------------------------------------------------------------------------------

following testing done with 319 mem:

This issue DOES reproduce on the raw base image v180, but DOES NOT repro with 2.0 Full Flashed on top

Actual Results - Opening all the apps listed in the STR and calling the DUT resulted in the callscreen appearing in 4-5 seconds on average on Base only.

Device: Flame 2.0 (Base Only) Repro
Build ID: 20140904160718
Gaia: 506da297098326c671523707caae6eaba7e718da
Gecko: 2b27becae85092d46bfadcd4fb5605e82e1e1093
Version: 32.0 (2.0)
Firmware Version: V180
User Agent: Mozilla/5.0 (Mobile; rv:32.0) Gecko/32.0 Firefox/32.0

Device: Flame 2.0 - No Repro
Build ID: 20141012000202
Gaia: 6effca669c5baaf6cd7a63c91b71a02c6bd953b3
Gecko: 54ec9cb26b59
Version: 32.0 (2.0)
Firmware Version: V180
User Agent: Mozilla/5.0 (Mobile; rv:32.0) Gecko/32.0 Firefox/32.0
Flags: needinfo?(dharris)
Regression window unavailable - 

This issue occurs in the oldest 2.1 build we have access to

Device: Flame 2.1
Build ID: 20140904062538
Gaia: a47ecb6368c015dd72148acde26413fd90ba3136
Gecko: ffb144a500a4
Version: 34.0a2 
Firmware Version: V180
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

and issue does not reproduce in JB

Device: Flame Master
Build ID: 20141003070740
Gaia: a8a6eed2ba9d66239aac789b9ee4900f911c73cb
Gecko: 388e101e75c8
Version: 35.0a1 (Master)
Firmware Version: V123
User Agent: Mozilla/5.0 (Mobile; rv:35.0) Gecko/35.0 Firefox/35.0
Flags: needinfo?(pbylenga)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(pbylenga)
Regression in call UI appearance = blocker.

This should probably be moved out of the Performance component to get some attention.  Gregor, is Systems FE the right component?
blocking-b2g: 2.1? → 2.1+
Flags: needinfo?(anygregor)
That sounds like window-management to me.
Component: Performance → Gaia::System::Window Mgmt
Flags: needinfo?(anygregor)
Can we do a regression window check on the 2.0 between raw base image v180 and full flash?
This sounds to me that we are drop the callscreen on memory pressure,
but in some gaia version (see comment 2) the kill/reload is either failing.

Etienne, any thought?
Flags: needinfo?(etienne)
(In reply to Alive Kuo [:alive][NEEDINFO!] from comment #7)
> This sounds to me that we are drop the callscreen on memory pressure,
> but in some gaia version (see comment 2) the kill/reload is either failing.
> 
> Etienne, any thought?

I don't think it's failing since the callscreen eventually comes up.
Looks like this is just bug 999478 working.

After a memorypressure we don't reload the callscreen until the next call.
And I don't think we get another event once the memory pressure is over, so we don't have a good heuristic to trigger a new preload of the callscreen.
Flags: needinfo?(etienne)
(In reply to howie [:howie] from comment #6)
> Can we do a regression window check on the 2.0 between raw base image v180
> and full flash?

If I'm understanding your question correctly - the answer is no - You (or WE) can not get a regression window between a base image and a branch of builds - regression windows have to be found within a branch itself AFAIK

If there IS a way to do this, we lack documentation / proper pushlog links to use / etc.
Hi Alive, so we still need your help to dig in more on this.
Flags: needinfo?(alive)
Very strange...it seems we are NEVER killing apps while there are 10 more apps running in 319MB 2.1 build.
So we are having more than 10 apps running at the same time and I think it's the root cause of this bug.

Cervantes, any idea?
Flags: needinfo?(alive) → needinfo?(cyu)
I wrote this up in bug 1080239 that apps are not being killed and it was resolved as invalid.
This is more like a device or configuration dependent issue.  For now, configuration of LMK and zram are not changed on booting according device's configuration.  HW vendors need to change LMK/zram according their owned HW/SW configuration to get better performance.  I think it is not only for 319MB, my dogfood flame with 1G also hit the same problem if I open enough processes.  Especially, now, we never close WEB pages.  If you follow links on FB app, over the time, you will get a bunch of processes for visited links, then you will slow down too.

Cervantes, please help to revise our LMK and OOM settings.
Assignee: nobody → cyu
(In reply to KTucker [:KTucker] from comment #12)
> I wrote this up in bug 1080239 that apps are not being killed and it was
> resolved as invalid.

I am sorry if I misunderstood something; we are having a feature to keep the oom-killed app in background but don't really keep the application alive, and from the bug comments we don't see if you know this feature or not so we think that is the this feature. What we want to see is the |adb shell b2g-info| result to confirm all apps are live.
I reproduced this problem on 2014-10-11-00-12-01 build. It requires several rounds of the steps in #c0 to see this problem. b2g-info shows that we are using LMK parameters on a low-memory device. I checked dmesg, and no process is killed after this STR. That is, the kernel works very hard to keep everything alive, but this is not expected by us.

We can
1. increase the LMK parameters, but I am suspicious about it. Even we double the parameters, free+cache is still larger than the minfree for background apps.
2. lower swapiness to make the kernel less likely to swap memory.

I guess we need to do both and will need more experiments to verify.
Flags: needinfo?(cyu)
I cross-checked the minfree parameters with Intex CloudFX by cat /sys/module/lowmemorykiller/parameters/minfree

Flame: 1024,1280,1536,1792,2048,2560
Intex: 1024,1280,1536,1792,2048,4608

So we set the parameters on flame 319 MB stricter than on Intex!! That must be totally wrong.
Also /sys/module/lowmemorykiller/parameters/notify_trigger:

Flame: 2304
Intex: 3584

Flame is still stricter than Intex.
Loop Paul and Walter in. I think this bug is relevant to the problem that the device becomes janky after MTBF tests.
minfree for background apps has always been 20480 MiB:

http://mxr.mozilla.org/mozilla-central/source/b2g/app/b2g.js#740
http://mxr.mozilla.org/mozilla-b2g32_v2_0/source/b2g/app/b2g.js#718
http://mxr.mozilla.org/mozilla-beta/source/b2g/app/b2g.js#737
http://mxr.mozilla.org/mozilla-aurora/source/b2g/app/b2g.js#740

but /system/b2g/defaults/pref/lmk.js changed notify_trigger and minfree for background apps. We need to check why the base image contains such parameters.
ni Wesly Huang for the issue in base image lmk settings. Wesly, we'd like your help to check why /system/b2g/defaults/pref/lmk.js has such restricted values. Thanks.
Flags: needinfo?(wehuang)
:cyu, would you move this bug to appropriate component? Thanks.
Flags: needinfo?(cyu)
Component: Gaia::System::Window Mgmt → GonkIntegration
Flags: needinfo?(cyu)
Hi Youlong:

Pls see discussion above then comment#20.

We are checking MTBF issue and now realize the low mem. killer setting in your image is quite strict (too late to kill process while free ram is very low), would like to know the reason behind and maybe need to change as well. Thank you.
Flags: needinfo?(wehuang) → needinfo?(youlong.jiang)
btw seems not in earlier SW before v180, is it possible that, this change is made when you upgrade to QCT CS?
Any update?
(In reply to Wesly Huang from comment #22)
> Hi Youlong:
> 
> Pls see discussion above then comment#20.
> 
> We are checking MTBF issue and now realize the low mem. killer setting in
> your image is quite strict (too late to kill process while free ram is very
> low), would like to know the reason behind and maybe need to change as well.
> Thank you.

hi wesly -

per you previous summarize, you doubt maybe this issue caused by lmk parameter strict, also checked /system/b2g/defaults/pref/lmk.js to reflect current status. we haven't modify this point, so could you help to analysis and provide configure interface and recommended value, we'll cooperate and release test base image to you.

tks.
Flags: needinfo?(youlong.jiang)
Hi Youlong: Are you saying the value are all default ones from QCT? pls help list the values used in v123, v180, and v188 for our reference.

Hi Cervantes: Would you help provide some suggestion about the value? could we reference some other products and select a proper one? Thank you.
Flags: needinfo?(youlong.jiang)
Flags: needinfo?(cyu)
(In reply to Wesly Huang from comment #26)
> Hi Cervantes: Would you help provide some suggestion about the value? could
> we reference some other products and select a proper one? Thank you.

I think the default values in b2g.js should work on most devices:

pref("hal.processPriorityManager.gonk.BACKGROUND.KillUnderKB", 20480);
pref("hal.processPriorityManager.gonk.notifyLowMemUnderKB", 14336);

We may just remove lmk.js to make the default values work and see if the issue remains. If the problem still remains, we might need other tweaks such increasing the values or changing swapiness.
Flags: needinfo?(cyu)
(In reply to Wesly Huang from comment #26)
> Hi Youlong: Are you saying the value are all default ones from QCT? pls help
> list the values used in v123, v180, and v188 for our reference.
> 
> Hi Cervantes: Would you help provide some suggestion about the value? could
> we reference some other products and select a proper one? Thank you.

we've check v123,v180 and v188, they are all the same.

pref("hal.processPriorityManager.gonk.BACKGROUND.KillUnderKB", 10240);
pref("hal.processPriorityManager.gonk.notifyLowMemUnderKB", 9216);

pls take it as refer

tks.
Flags: needinfo?(youlong.jiang)
Per comment 18, this is kind of affecting the result of MTBF testing. Setting block to MTBF-B2G meta bug
Blocks: MTBF-B2G
Hi, Cervantes,

Do we even need different parameters for 319? If so, can you help with getting a parameter and communicate with people who might know how to better examinate when we set it to 319?
Flags: needinfo?(cyu)
As I said in comment #27. We don't need the default values in lmk.js. They are too restrictive and are the root cause of this bug.
Flags: needinfo?(cyu)
I rm /system/b2g/defaults/pref/lmk.js and the problem is not reproduced on my flame. My flame is vanilla v188 image. So I suggest just removing lmk.js and using the default values in gecko.
Sorry for catch up late. ni T2M

@Youlong: as just discussed in phone, pls help arrange a local build (userdebug) with comment#27's suggestion then release to me for more verification here. Thank you.
Flags: needinfo?(youlong.jiang)
Summary: [Performance][Dialer] Call UI can take 5 seconds to appear when multiple other apps are open. → [Performance][Dialer] lmk.js is wrong in Flame base image
(In reply to Wesly Huang from comment #33)
> Sorry for catch up late. ni T2M
> 
> @Youlong: as just discussed in phone, pls help arrange a local build
> (userdebug) with comment#27's suggestion then release to me for more
> verification here. Thank you.

hi wesly -

we found lmk.js is generated per build. so, could you pls help to provide point that correct lmk.js value for your test.

tks.
Flags: needinfo?(youlong.jiang)
(In reply to youlong.jiang from comment #34)
> (In reply to Wesly Huang from comment #33)
> > Sorry for catch up late. ni T2M
> > 
> > @Youlong: as just discussed in phone, pls help arrange a local build
> > (userdebug) with comment#27's suggestion then release to me for more
> > verification here. Thank you.
> 
> hi wesly -
> 
> we found lmk.js is generated per build. so, could you pls help to provide
> point that correct lmk.js value for your test.
> 
> tks.
Viral, any idea?
Flags: needinfo?(vwang)
Per comment 32, please just do as Cervantes suggest.
Flags: needinfo?(vwang)
re-ni? stakeholder
Flags: needinfo?(vwang)
Actually I do think Cervantes already provide the answer in comment 27.
We should keep it as default value.
Flags: needinfo?(vwang)
Hi Youlong, I assume your questions is "how to" change the value as suggested in comment#27, right? If Cervantes and Viral has no suggestion here I recommand to go for QCT for answer.


@Cervantes, Viral: do you know that?
Flags: needinfo?(youlong.jiang)
Flags: needinfo?(vwang)
Flags: needinfo?(cyu)
Hi Wesly,

It looks like the code comes from qualcomm:

in "device/qcom/msm8610/msm8610.mk"
out/target/product/$(TARGET_PRODUCT)/system/gecko: gaia/profile/defaults/pref/lmk.js
.PHONY: gaia/profile/defaults/pref/lmk.js
gaia/profile/defaults/pref/lmk.js: gaia/profile.tar.gz
        echo 'pref("hal.processPriorityManager.gonk.BACKGROUND.KillUnderKB", 10240);' > $@
        echo 'pref("hal.processPriorityManager.gonk.notifyLowMemUnderKB", 9216);' >> $@

It will overwrite our default setting.
I think we should ask qualcomm the season why they modify the low memory killer parameters.
Maybe they can remove it and then we can use default value as our expect.
Flags: needinfo?(wehuang)
Flags: needinfo?(vwang)
Flags: needinfo?(cyu)
Thanks for Viral's help!

@Youlong, pls help check with QCT for the reason, see if it's ok to change, and how to change. Thank you.
Flags: needinfo?(wehuang)
Hi Michael,

Not sure if you can help on this question in comment 40.
Looks like you overwrite parameters of low memory killer in device/qcom/msm8610/msm8610.mk
We suffer some OOM issues in this case.
Is that possible that you remove it and we can use default setting for lmk?
Flags: needinfo?(mvines)
(In reply to viral [:viralwang] from comment #40)
> Hi Wesly,
> 
> It looks like the code comes from qualcomm:
> 
> in "device/qcom/msm8610/msm8610.mk"
> out/target/product/$(TARGET_PRODUCT)/system/gecko:
> gaia/profile/defaults/pref/lmk.js
> .PHONY: gaia/profile/defaults/pref/lmk.js
> gaia/profile/defaults/pref/lmk.js: gaia/profile.tar.gz
>         echo 'pref("hal.processPriorityManager.gonk.BACKGROUND.KillUnderKB",
> 10240);' > $@
>         echo 'pref("hal.processPriorityManager.gonk.notifyLowMemUnderKB",
> 9216);' >> $@
> 
> It will overwrite our default setting.
> I think we should ask qualcomm the season why they modify the low memory
> killer parameters.
> Maybe they can remove it and then we can use default value as our expect.


We have changed this to save background apps getting killed by LMK frequently. Reducing this parameter ensures that we can have more background apps and we will be using zram more on 256MB device to do this.

Instead of changing this value, I would suggest to understand who is taking most cpu usage in use case from Comment 0 "Incoming Call UI takes up to 5 seconds to appear and allow the user to accept or deny the call." .

Running |adb shell top -t -m 10| should tell us cpu usage during this operation.
Flags: needinfo?(mvines)
           NAME   PID PPID CPU(s) NICE  USS  PSS  RSS SWAP VSIZE OOM_ADJ USER     
            b2g   206    1  637.9    0 52.1 53.4 60.3 19.6 251.5       0 root     
         (Nuwa)   400  206    5.7    0  0.0  0.2  1.8  6.8  53.7     -16 root     
OperatorVariant   935  400    7.7   18  0.3  0.7  5.1  9.1  61.8      10 u0_a935  
     Homescreen  1185  400   90.8    1  5.2  6.1 12.1 12.3  83.4       2 u0_a1185 
        Browser  3435  400   23.8   18  1.5  2.3  8.4 13.4  71.2      10 u0_a3435 
Smart Collectio  9856  400    5.8   18  2.8  3.4  8.7  8.7  64.1      10 u0_a9856 
       Messages 10881  206    3.9   18  4.4  5.0  9.9 11.1  75.3      10 u0_a10881
 Communications 10890  400    4.8   18  6.3  7.2 13.4 10.8  71.4      10 u0_a10890
       Settings 10962  400    5.3   18  4.7  5.6 11.6 10.5  68.0      10 u0_a10962
    Marketplace 11011  400    9.0   18  7.7  8.5 14.3 11.2  76.6      10 u0_a11011
(Preallocated a 11309  400    0.8   18  4.0  4.7  9.8  4.8  60.8       1 u0_a11309


System memory info:

            Total 215.3 MB
        SwapTotal 192.0 MB
     Used - cache 187.8 MB
  B2G procs (PSS)  97.2 MB
    Non-B2G procs  90.6 MB
     Free + cache  27.5 MB
             Free   6.6 MB
            Cache  20.9 MB
         SwapFree  80.9 MB
Low-memory killer parameters:

  notify_trigger 9216 KB

  oom_adj min_free
        0  4096 KB
       58  5120 KB
      117  6144 KB
      352  7168 KB
      470  8192 KB
      588 10240 KB

SwapFree  80.9 MB : it tells that almost 192-80.9 = 111 MB data is pushed to zram device. 

We have at least 7 background apps running when this issue happened. But we should also understand cpu usage before making LMK more aggresive to kill background apps.
Running top on the device shows that the system has high system CPU usage (>50%) and relatively low user space CPU usage (~25%) during the 5 sec period from ringtone playing to call screen showing up. It's very likely that the system tries really hard swapping memory into/out of zram. No sign of lowmemory killer taking action.

Actually from dmesg, I also see OOM killer taking action like:

<3>[30963.239308] Out of memory: Kill process 205 (b2g) score 575 or sacrifice child
<3>[30963.245539] Killed process 325 ((Nuwa)) total-vm:74100kB, anon-rss:108kB, file-rss:16kB
<4>[32064.980410] b2g-ps invoked oom-killer: gfp_mask=0xd0, order=2, oom_adj=0, oom_score_adj=0
<6>[32064.987646] [<c010bd74>] (unwind_backtrace+0x0/0xf8) from [<c087d9c8>] (dump_header.isra.10+0x74/0x180)

So I strongly suggest increasing the values as previously suggested.
remove ni since Cervantes already feedback.
Flags: needinfo?(vwang)
(In reply to Cervantes Yu from comment #45)
> Running top on the device shows that the system has high system CPU usage
> (>50%) and relatively low user space CPU usage (~25%) during the 5 sec
> period from ringtone playing to call screen showing up. It's very likely
> that the system tries really hard swapping memory into/out of zram. No sign
> of lowmemory killer taking action.
> 


Could you please also post full output of |adb shell top -t -m 10| . I am curious to see what are those processes :) .
Flags: needinfo?(vwang)
Flags: needinfo?(cyu)
Flags: needinfo?(vwang)
Flags: needinfo?(cyu)
More experiment: If we don't ask the background process to GC then the performance is much better on incoming call, even with the current lmk setting and many apps open. Call screen shows up in about 1 sec after ringtone plays.

My wild guess is that zram and GC'ing the background process just doesn't work well with each other. With the foreground and background app running concurrently, zram could repeatedly and alternatively swapping pages in and out in these 2 processes.

This is not to say that we don't need to change the lmk settings. With the current lmk settings, I could even run out of swap space and let the OOM killer kicks in to kill a random process. This is really bad for end users.

Gabriele, what's your comment on not GC when sending the process to background?
Flags: needinfo?(gsvelto)
(In reply to Cervantes Yu from comment #49)
> Gabriele, what's your comment on not GC when sending the process to
> background?

We already encountered the problem in the past (bug 963477) and disabled GC'ing the background application in the v1.3t branch only. The real solution would be to provide a fix for bug 1082290 which I'm working on, but that won't be ready for some time and I see that this bug is 2.1+ so we need a quick fix here. Also I'm not sure if the feature bug 1082290 will use is present in all kernels we support. My guess is that it's not so it wouldn't be enough.

What I would suggest is that we fix bug 975360 instead. The idea is that we would have a pref that establishes if applications sent in the background are GC'd or not. zram-based devices will then turn this pref off to prevent needlessly swapping from zram.

Bug 963477 delayed the GC but I don't think it's a good idea in general because that's going to cause a slow-down anyway (just later) and it might have a negative impact on battery life due to the the significant swapping.
Flags: needinfo?(gsvelto)
(In reply to Cervantes Yu from comment #48)
> Created attachment 8523640 [details]
> CPU usage when the bug is reproduced

I am seeing below line in cpu usage log which suggests b2g main process is doing some activity :

14172  0  21% S    69 243864K  46960K     root     /system/b2g/b2g

For comment 50,

My vote will be to fix gc issues instead of changing LMK.. It seems like we are moving in right direction already :)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #51)
> (In reply to Cervantes Yu from comment #48)
> > Created attachment 8523640 [details]
> > CPU usage when the bug is reproduced
> 
> I am seeing below line in cpu usage log which suggests b2g main process is
> doing some activity :
> 
> 14172  0  21% S    69 243864K  46960K     root     /system/b2g/b2g
> 
> For comment 50,
> 
> My vote will be to fix gc issues instead of changing LMK.. It seems like we
> are moving in right direction already :)

No, lmk.js also needs to be changed. Otherwise, if the OOM killer kicks in, it will kill a random process. The worst case is the b2g process, which is a system crash.
Flags: needinfo?(bbajaj)
Hi Gabriele and Tapas, 

what's your view about comment#52, that fix not only GC but also lmk? Since per comment#27 & comment#28 indeed Flame has smaller value then a working setting in current b2g.js?
Flags: needinfo?(tkundu)
Flags: needinfo?(gsvelto)
(In reply to Wesly Huang from comment #53)
> what's your view about comment#52, that fix not only GC but also lmk? Since
> per comment#27 & comment#28 indeed Flame has smaller value then a working
> setting in current b2g.js?

I agree. The values present in the base image don't make any sense. They will prevent the order we set up to kill applications from working correctly since the KillUnderKB value for background applications is terribly close to all others. There's also not enough room around the notifyLowMemUnderKB threshold possibly making low-memory notifications useless.

Note that the default parameters were designed with a 256MiB device in mind. On a device like the flame with more memory those could be raised a bit to allow for more wiggle room when lots of apps are open but definitely not lowered.
Flags: needinfo?(gsvelto)
(In reply to Cervantes Yu from comment #52)
> (In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from
> comment #51)
> > (In reply to Cervantes Yu from comment #48)
> > > Created attachment 8523640 [details]
> > > CPU usage when the bug is reproduced
> > 
> > I am seeing below line in cpu usage log which suggests b2g main process is
> > doing some activity :
> > 
> > 14172  0  21% S    69 243864K  46960K     root     /system/b2g/b2g
> > 
> > For comment 50,
> > 
> > My vote will be to fix gc issues instead of changing LMK.. It seems like we
> > are moving in right direction already :)
> 
> No, lmk.js also needs to be changed. Otherwise, if the OOM killer kicks in,
> it will kill a random process. The worst case is the b2g process, which is a
> system crash.

>> Otherwise, if the OOM killer kicks in, it will kill a random process.

Not really. LMK will kill b2g process only if system is still under memory pressure even after it kills NUWA, homescreen, preallocate app and all other FFOS apps. We never saw b2g getting killed randomly if there is no memleak on system. 


IMO, we should modify lmk.js only after solving gc issues. If we still see problem then we can go ahead and change lmk settings.
Flags: needinfo?(tkundu) → needinfo?(wehuang)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #55)
> Not really. LMK will kill b2g process only if system is still under memory
> pressure even after it kills NUWA, homescreen, preallocate app and all other
> FFOS apps. We never saw b2g getting killed randomly if there is no memleak
> on system.

Yes, the main process is in a class of his own and *all* other apps will be killed before it. The only scenario in which the main process can be killed is if it's exhausted all memory on his own and all other apps have already been killed.

> IMO, we should modify lmk.js only after solving gc issues. If we still see
> problem then we can go ahead and change lmk settings.

The LMK changes are also needed. The settings you're seeing here are breaking our OOM policy. This page contains more details on the process and describes why the values of those parameters were set that way in b2g.js

https://developer.mozilla.org/en-US/Firefox_OS/Platform/Out_of_memory_management_on_Firefox_OS
(In reply to Gabriele Svelto [:gsvelto] from comment #56)
> The LMK changes are also needed. The settings you're seeing here are
> breaking our OOM policy. This page contains more details on the process and
> describes why the values of those parameters were set that way in b2g.js
> 
> https://developer.mozilla.org/en-US/Firefox_OS/Platform/
> Out_of_memory_management_on_Firefox_OS

ok please change lmk settings if you feel it is needed :) thanks for informing us.
Whiteboard: [2.1-exploratory-3] → [2.1-exploratory-3][mtbf]
According to comment 40, comment 56, and comment 57, we need to change lmk settings. We need T2M's help to change that. If T2M doesn't know how to do it, they will need to talk with the vendor who provide code to T2M. Thanks.
Component: GonkIntegration → Vendcom
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #55)
> (In reply to Cervantes Yu from comment #52)
> > (In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from
> > comment #51)
> > > (In reply to Cervantes Yu from comment #48)
> > > > Created attachment 8523640 [details]
> > > > CPU usage when the bug is reproduced
> > > 
> > > I am seeing below line in cpu usage log which suggests b2g main process is
> > > doing some activity :
> > > 
> > > 14172  0  21% S    69 243864K  46960K     root     /system/b2g/b2g
> > > 
> > > For comment 50,
> > > 
> > > My vote will be to fix gc issues instead of changing LMK.. It seems like we
> > > are moving in right direction already :)
> > 
> > No, lmk.js also needs to be changed. Otherwise, if the OOM killer kicks in,
> > it will kill a random process. The worst case is the b2g process, which is a
> > system crash.
> 
> >> Otherwise, if the OOM killer kicks in, it will kill a random process.
> 
> Not really. LMK will kill b2g process only if system is still under memory
> pressure even after it kills NUWA, homescreen, preallocate app and all other
> FFOS apps. We never saw b2g getting killed randomly if there is no memleak
> on system. 
> 
There are 2 different killers: The low memory killer (LMK) and the OOM killer. The settings in lmk.js affects the low memory killer, but I am talking about the OOM killer. OOM killer kills a process based on the OOM score. The score is computed using various hints, one of which is how much memory it consumes.

On a running flame (319M), b2g's oom_score is even higher than the preallocated process:
b2g              0 root      203   1     214608 58876 ffffffff b6ef4894 S /system/b2g/b2g
(Preallocated a  2 u0_a2069  2069  389   79988  17408 ffffffff b6ef4894 S /system/b2g/b2g

root@flame:/ # cat /proc/203/oom_score                                         
150
root@flame:/ # cat /proc/2069/oom_score                                        
113

That is, when the "OOM killer" kicks in, b2g will be killed before the preallocated process, even we set the "LMK" to work the other way.

> IMO, we should modify lmk.js only after solving gc issues. If we still see
> problem then we can go ahead and change lmk settings.

For the above reason, we should modify lmk.js whether we solve the gc issue or not. Otherwise we might run into getting the b2g process killed.
Hi Youlong, pls see comment#58 and contact QC for help to change lmk setting.

@Tapas: if you know how to do this maybe you can kindly guide T2M here? Thanks.
Flags: needinfo?(wehuang) → needinfo?(tkundu)
It's just a matter of removing that file. We already have sane defaults for those settings in master gecko.
(In reply to Gabriele Svelto [:gsvelto] from comment #61)
> It's just a matter of removing that file. We already have sane defaults for
> those settings in master gecko.

Yes . Agreed. Please let me know if that works for you/T2M :) .
Flags: needinfo?(tkundu)
Flags: needinfo?(wehuang)
Flags: needinfo?(bbajaj)
Thanks Tapas, Gabriele.

@Youlong: pls follow the suggestion above and let us know if any question. Thanks.
Flags: needinfo?(wehuang)
(In reply to Wesly Huang from comment #63)
> Thanks Tapas, Gabriele.
> 
> @Youlong: pls follow the suggestion above and let us know if any question.
> Thanks.

hi wesly -

we've moved lmk.js from system and would release version with patch taken.

tks.
Flags: needinfo?(youlong.jiang)
verify with the latest base image v18D, lmk.js is already removed.
Whiteboard: [2.1-exploratory-3][mtbf] → [2.1-exploratory-3][mtbf][POVB]
Hi Tapas:

Now the change is applied in T2M's SW release to us, however for our own full build it still links to code in CAF, do you think you can make same change there?
Flags: needinfo?(tkundu)
Please just fork the CAF project if you'd like to customize its contents for your Flame build.
Flags: needinfo?(tkundu)
Hi Wesly, any further action needed for this issue?
Flags: needinfo?(wehuang)
Any updates?

Also, we can for the CAF project but the things on T2M phones will still be incorrect.
(In reply to Steven Yang [:styang] from comment #68)
> Hi Wesly, any further action needed for this issue?

In comment#65 it's verified v18D has done the removal as suggested in comment#61 & #62, so I see no further action needed for T2M/Flame base image. (also we don't have further Flame base image release plan after v18D)

My understanding is, the left thing is if we would like to make changes accordingly in Mozilla's Flame build, comment#66 & #67 covers this topic, so it depends on if we would like to fork it in our code.
Flags: needinfo?(wehuang)
Does this bug still blocks 2.2 MTBF?
Flags: needinfo?(wachen)
I believe that the setting is still wrong in new build/images.
Flags: needinfo?(wachen)
See Also: → 1155854
MTBF Triage: remove from MTBF monitor.
No longer blocks: MTBF-B2G
Whiteboard: [2.1-exploratory-3][mtbf][POVB] → [2.1-exploratory-3][POVB]
blocking-b2g: 2.1+ → 2.5+
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage+][qa-tracking]
Per comment 32 and comment 65, lmk.js had been removed in v18D image. This should already be resolved, and not be 2.5 blocker. 

Mark verifyme to double check.

-----
Build ID               20150804150207
Gaia Revision          c5425d9f1f5184731a59ed4bc99295acbde30390
Gaia Date              2015-08-04 16:09:19
Gecko Revision         https://hg.mozilla.org/mozilla-central/rev/f3b757156f69
Gecko Version          42.0a1
Device Name            flame
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.cltbld.20150712.193621
Firmware Date          Sun Jul 12 19:36:34 EDT 2015
Bootloader             L1TC000118D0
please help to verify bug. Thanks.
Keywords: qawanted
This issue is still reproducing on Flame 2.2 and 2.1. Following STR, device takes more than 5 seconds to show call UI. Reproduction frequency is 3 out of 3 on each branch.

Device: Flame 2.2 (full flashed 319MB KK)
BuildID: 20150828032506
Gaia: 335cd8e79c20f8d8e93a6efc9b97cc0ec17b5a46
Gecko: 16d864d163de
Gonk: bd9cb3af2a0354577a6903917bc826489050b40d
Version: 37.0 (2.2) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0

Device: Flame 2.1 (full flashed 319MB KK)
BuildID: 20150724001207 (note: we stopped getting newer builds on this branch)
Gaia: 9dba58d18006e921546cec62c76074ce81e16518
Gecko: 41e10c6740be
Gonk: bd9cb3af2a0354577a6903917bc826489050b40d
Version: 34.0 (2.1) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:34.0) Gecko/34.0 Firefox/34.0

-------

This issue does NOT occur on Flame 2.5/master. Following STR, call UI is displayed within 2 seconds.

I think the reason why this doesn't repro on master is because of bug 1172167 where apps are getting aggressively killed in the background. Without that bug, master is likely still affected.

Device: Flame 2.5 (full flashed 319MB KK)
BuildID: 20150828030207
Gaia: b69c16798ddd7154207f56d983721a327522f5d1
Gecko: 87e23922be375985d0b1906ed5ba5f095f323a38
Gonk: c4779d6da0f85894b1f78f0351b43f2949e8decd
Version: 43.0a1 (2.5 Master) 
Firmware Version: v18Dv4
User Agent: Mozilla/5.0 (Mobile; rv:43.0) Gecko/43.0 Firefox/43.0
QA Whiteboard: [QAnalyst-Triage+][qa-tracking] → [QAnalyst-Triage?][qa-tracking][failed-verification]
Flags: needinfo?(jmercado)
Keywords: qawanted, verifyme
Bobby please see comment 76
QA Whiteboard: [QAnalyst-Triage?][qa-tracking][failed-verification] → [QAnalyst-Triage+][qa-tracking][failed-verification]
Flags: needinfo?(jmercado) → needinfo?(bchien)
Per comment 76, could you share b2g-info log per your STR? So that we could work with Cervantes for further troubleshooting.
Flags: needinfo?(pcheng)
Flags: needinfo?(cyu)
Flags: needinfo?(bchien)
Attached file bug1081577_b2g-info
Attaching b2g-info after bug reproduced on Flame 2.2 319MB memory.
Flags: needinfo?(pcheng)
  oom_adj min_free
        0  4096 KB
       58  5120 KB
      117  6144 KB
      352  7168 KB
      470  8192 KB
      588 10240 KB
      ^^^^^^^^^^^^
This explains why the problem remains. But it's verified that /system/b2g/defaults/pref/lmk.js is removed, isn't it? We need to find out where the value 10240 KB comes from.
Flags: needinfo?(cyu)
Marking as P2 for 2.5
Priority: -- → P2
As result in comment 74 and comment 76, this is issue has been fixed in v2.5/v18D. Removed v2.5 blocker. 

However, as comment 74, lmk.js is already removed from v18D image. So there is no default lmk.js configurations if based on v18D. Suppose there is another configuration in v2.2, which would cause result in Description and comment 80. 

Mahe and Josh, do we continue investigate and fix issue in v2.1 and v2.2?
blocking-b2g: 2.5+ → ---
Flags: needinfo?(mpotharaju)
Flags: needinfo?(jcheng)
Hi Bobby,
I would prefer consider this for 2.2r as it is the one we have device shipping. 
However I am not sure the fix from bug 1172167 also apply to 2.2r as it is different UI?
Flags: needinfo?(jcheng) → needinfo?(martijn.martijn)
(In reply to Josh Cheng [:josh] from comment #83)
> Hi Bobby,
> I would prefer consider this for 2.2r as it is the one we have device
> shipping. 
> However I am not sure the fix from bug 1172167 also apply to 2.2r as it is
> different UI?

Josh, I'm not sure what you're asking. I only disabled a test in bug 1172167, because that test is failing because of aggressive lmk (although it could also be fixed in a different way).
Flags: needinfo?(martijn.martijn)
Martijn, 

Based on comment 76, 1172167 was referred as the reason this issue is not being reported on Master. Can we get a confirmation if this is the bug that is making issue not reproducible on Master?
I think Josh meant to mention a different bug in comment 83, right Josh?
Flags: needinfo?(jocheng)
(In reply to Martijn Wargers [:mwargers] (QA) from comment #86)
> I think Josh meant to mention a different bug in comment 83, right Josh?

The bug I mentioned is base on comment 76 from Pei-Wei: "doesn't repro on master is because of bug 1172167 where apps are getting aggressively killed in the background."
Did Pei-Wei mention wrong bug?
Flags: needinfo?(jocheng)
(In reply to Josh Cheng [:josh] from comment #87)
> The bug I mentioned is base on comment 76 from Pei-Wei: "doesn't repro on
> master is because of bug 1172167 where apps are getting aggressively killed
> in the background."
> Did Pei-Wei mention wrong bug?

I guess not, but in comment 83, you mentioned:

(In reply to Josh Cheng [:josh] from comment #83)
> However I am not sure the fix from bug 1172167 also apply to 2.2r as it is
> different UI?

I don't see any fix in bug 1172167. There is only a pull request there that disabled one of the Gaia UI tests.

(In reply to Mahendra Potharaju [:mahe] from comment #85)
> Martijn, 
> 
> Based on comment 76, 1172167 was referred as the reason this issue is not
> being reported on Master. Can we get a confirmation if this is the bug that
> is making issue not reproducible on Master?

There is no clear idea on what caused bug 1172167. Perhaps it was caused by the pull request from bug 1094759. That is certainly something that's not easily backported at all.
I can certainly understand that by eagerly killing apps, this bug doesn't appear anymore.
Bobby, Yes, we need to continue investigate this. This issues is a blocker if surfaced on 2.5 OR 2.2. We are limiting patches on 2.2 as QualComm has completed their testing. Unless identified and fixed we cannot confirm it doesn't resurface on Master.
Flags: needinfo?(mpotharaju) → needinfo?(bchien)
See comment 3 for why we can't find a regression window for this bug. This is a vendor issue.
Flags: needinfo?(jmercado)
Sorry, I missed that. Thanks for pointing that out Pi Wei. But we still would need to continue investigate this on 2.2.
Flags: needinfo?(jmercado)
mark 2.2+ for tracking.
blocking-b2g: --- → 2.2+
See Also: → 1124093
mark as resolved fixed in v2.5 and following. Leave 2.2 as wontfix.
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(bchien)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.