Closed Bug 887192 Opened 7 years ago Closed 7 years ago

GonkHal should write lowmemorykiller parameters in increasing order

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:leo+, firefox23 wontfix, firefox24 wontfix, firefox25 fixed, b2g18? fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)

VERIFIED FIXED
1.1 QE4 (15jul)
blocking-b2g leo+
Tracking Status
firefox23 --- wontfix
firefox24 --- wontfix
firefox25 --- fixed
b2g18 ? fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- wontfix
b2g-v1.1hd --- fixed

People

(Reporter: alan.yenlin.huang, Assigned: alan.yenlin.huang)

References

Details

(Keywords: perf, Whiteboard: [c= ][LeoVB+])

Attachments

(1 file, 3 obsolete files)

I found we now have decreasing order of lowmemorykiller parameters from "/sys/module/lowmemorykiller/parameters/minfree" and "/sys/module/lowmemorykiller/parameters/adj". 

$ adb shell cat /sys/module/lowmemorykiller/parameters/adj
6,4,3,2,1,0
$ adb shell cat /sys/module/lowmemorykiller/parameters/minfree
5120,2048,1792,1536,1280,1024

In lowmemorykiller.c, it assume that these parameters are in increasing order. It calculates # of free pages, and begins to compare the # of frees pages with the first minfree parameters and so on. This loop breaks when # of free pages is less than the minfree parameter it has in hand. 

in lowmem_shrink of lowmemorykiller.c:

>        for (i = 0; i < array_size; i++) {
>                if (other_free < lowmem_minfree[i] &&
>                    other_file < lowmem_minfree[i]) {
>                        min_adj = lowmem_adj[i];
>                        break;
>                }
>        }

So if we set lowmemorykiller parameters in decreasing order, we will have unexpected result of lowmem_shrink in lowmemorykiller.c. 

I think this bug come from Bug 870181 patch part 1, and the fix should be easy: just invert the loop through NUM_PROCESS_PRIORITY in EnsureKernelLowMemKillerParamsSet() form GonkHal.cpp.
blocking-b2g: --- → tef?
Hello Justin,
I just invert the loop which go through all process priorities in this patch. Please help to review this, thanks!
Attachment #767802 - Flags: review?(justin.lebar+bug)
Comment on attachment 767802 [details] [diff] [review]
Invert the loop through NUM_PROCESS_PRIORITY

Ouch, thanks for finding this bug, Alan.

Can you please add an assertion in the code that we set these params in the right order, and add a comment briefly explaining why this is important?
Attachment #767802 - Flags: review?(justin.lebar+bug) → feedback+
As I read the kernel code, the effect of this bug is that when there's less than 5120*4kb = 20mb free, we'll always kill some process, even if there are no bg processes present.  After this bug is fixed, we'll kill a fg process only when there's less than 1280 * 4kb = 5mb free.

So this patch lets fg apps use 15mb more memory, which is great.

However, with or without this patch, we still kill processes in strict order of priority.

Is that right?

I'm not totally sure that we want to take this for 1.1, because the effect of letting processes use more memory is that we'll have less memory available for the file cache, which has an impact on performance.  OTOH, this effectively expands the amount of memory on our phone significantly, which should help us avoid some OOM crashes, and I know partners care dearly about those.
blocking-b2g: tef? → leo?
This might explain some of the behavior we were seeing when receiving phone calls & memory pressure events at the same time, I definitely need to re-test with this patch applied.

BTW We're in Taipei with Leo & Tef teams so I can raise this issue during today's meetings and see if they're interested in having the patch uplifted.
> BTW We're in Taipei with Leo & Tef teams so I can raise this issue during today's meetings and see 
> if they're interested in having the patch uplifted.

Great, let me know if there's anything I can do to help.
We'd like to get the leo+ status for this bug because it can bring significant benefits:

- As Justin pointed out in comment 3 it will enable the fg application to get 15 more megs of free memory which is significant on 256MiB devices

- It will make the order in which applications are killed due to OOM more predictable. This is important also for devices with lots of memory; with many applications open the user might find that certain important applications are killed unexpectedly even though they should keep running (such as the music player, see bug 824000)

If we can get the patch through our regular tests the risk factor for taking this patch should be fairly low. The reason is that we already exercise this functionality extensively and we would only be changing its configuration, not the logic behind it. Additionally the affected code is nearly identical between b2g18 and central so we don't need to land different patches. If we can prove in a test that applications will get killed in the proper order then I think we should be pretty safe.
Blocks: 824000
Patch v2, I also add assertion to make sure we won't break this rule in future.
Attachment #767802 - Attachment is obsolete: true
Attachment #768148 - Flags: review?(justin.lebar+bug)
> - It will make the order in which applications are killed due to OOM more predictable.

I was trying to say in comment 3 that I don't think this is the case.  I'm looking again and I /think/ I was right, but I'm not sure.

I'm pretty tired; can you look at the kernel source and let me know what you think?
What's weird is, it looks like we'll only kill processes with oom_adj 6 using the lmk, since selected_oom_adj will always be 6.

But then why do we ever see the LMK killing fg apps?  I've also seen the OOM killer get involved, and that makes sense.  But it's not the only thing that kills fg apps.  So I feel like I'm missing something here.

This might explain why I've seen the homescreen app (oom_adj 4) outlive a fg app (oom_adj 2).  That's certainly bad, and this patch will probably fix that.  I'm just not expecting miracles here.
Comment on attachment 768148 [details] [diff] [review]
Write LMK parameters in increasing order, and add assertion when someone breaks the rule

>@@ -1074,21 +1077,33 @@ EnsureKernelLowMemKillerParamsSet()
>     int32_t killUnderMB;
>     if (!NS_SUCCEEDED(Preferences::GetInt(
>           nsPrintfCString("hal.processPriorityManager.gonk.%s.KillUnderMB",
>                           ProcessPriorityToString(priority)).get(),
>           &killUnderMB))) {
>       MOZ_CRASH();
>     }
> 
>+    // LMK in kernel calculates # of free pages, and begins to compare the # 
>+    // of frees pages with the first minfree parameters and so on. The loop 
>+    // in LMK breaks when # of free pages is less than the minfree parameter 
>+    // it has in hand. So, we have to make sure our parameters are in 
>+    // increasing order

It's probably better to say "number" instead of "#".

Reading this, I don't think we can describe in a comment what's going on well
enough for someone who isn't familiar with this bug to understand this.  We
should just say something like

  The LMK silently malfunctions if we assign the parameters in non-decreasing
  order; see bug 887192.

>+    MOZ_ASSERT(oomScoreAdj > lowerBoundOfNextOomScoreAdj && 

Nit: Trailing space.

>+               killUnderMB > lowerBoundOfNextKillUnderMB);

Also, please split this assertion in two.

>     // adj is in oom_adj units.
>     adjParams.AppendPrintf("%d,", OomAdjOfOomScoreAdj(oomScoreAdj));
> 
>     // minfree is in pages.
>     minfreeParams.AppendPrintf("%d,", killUnderMB * 1024 * 1024 / PAGE_SIZE);
>+
>+    // Update the lower buonds of next oomScoreAdj and killUnderMB parameters,

s/,/./
s/buonds/bounds/

Actually, I don't think you need this comment at all.  :)

>+    lowerBoundOfNextOomScoreAdj = oomScoreAdj;
>+    lowerBoundOfNextKillUnderMB = killUnderMB;
>   }
Attachment #768148 - Flags: review?(justin.lebar+bug) → review+
(In reply to Justin Lebar [:jlebar] from comment #8)
> I'm pretty tired; can you look at the kernel source and let me know what you
> think?

Sure, I'll have a look ASAP.
Revise comments and split assertion statement into two.

I tested this on my unagi, both debug and non-debug build. I also pushed this to try server and waiting for result. See https://tbpl.mozilla.org/?tree=Try&rev=bf7acaa7ad89 .
Attachment #768148 - Attachment is obsolete: true
Attachment #768281 - Flags: review?(justin.lebar+bug)
Someone on #developers noticed that the try push for this patch used:
"try: -b do -p all -u all -t none"

This causes 160-180 hours of machine time to be used - when the patch was B2G only.

Please can you choose more specific platforms using http://trychooser.pub.build.mozilla.org/ when you don't need to test everywhere, to save using resources unnecessarily (which in the case of the EC2 run jobs, directly costs $$$ per push).

Thanks! :-)
(In reply to Ed Morley [:edmorley UTC+1] from comment #13)
> Someone on #developers noticed that the try push for this patch used:
> "try: -b do -p all -u all -t none"
> 
> This causes 160-180 hours of machine time to be used - when the patch was
> B2G only.
I got it, thanks for reminding. 
I didn't notice I am just doing this in GonkHal. I stopped the try push, thanks!
> Attachment #768281 [details] [diff] - Flags: review?(justin.lebar+bug@gmail.com)

I gave you r+ on the last patch, so you don't need to ask for another review.

If on the other hand there's something here that you'd like me to look at, I'm happy to; just let me know what it is.
Attachment #768281 - Flags: review?(justin.lebar+bug)
(In reply to Gabriele Svelto [:gsvelto] from comment #6)
> We'd like to get the leo+ status for this bug because it can bring
> significant benefits:
> 
> ...
> 
> If we can get the patch through our regular tests the risk factor for taking
> this patch should be fairly low...

From Triage:

Since the TBPL tests didn't complete we don't have the low risk confirmation that Gabriele mentioned in comment #6 so the risk is too high for leo+. Nominating for koi and tracking.
Status: NEW → ASSIGNED
blocking-b2g: leo? → koi?
tracking-b2g18: --- → ?
Keywords: perf
Whiteboard: [c= ]
Maybe we misunderstood each other.

Whether or not this patch passes tests by the time triage happens to look at the patch has nothing to do with how risky the patch is.  That has to do with how loaded our infrastructure is.  We're not making blocking decisions based on infra load, right?

I'm pretty sure Gabriele was saying in comment 6 that he thinks this patch will be safe, in the sense that we'd notice any regressions as part of our automated test suite.  I don't think he was saying that you should not approve for blocking if triage happened to occur before the automated tests completed.
blocking-b2g: koi? → leo?
Allan, can you please push this to try again (for the relevant platforms) so release drivers can make an informed decision here?
Next time you would like completed test runs before making a triage decision, I would really appreciate it if you'd consider asking for those and waiting, instead of proactively marking the bug as blocking-minus.  That would make us in engineering feel more like partners in this process.
(In reply to Justin Lebar [:jlebar] from comment #19)
> Next time you would like completed test runs before making a triage
> decision, I would really appreciate it if you'd consider asking for those
> and waiting, instead of proactively marking the bug as blocking-minus.  That
> would make us in engineering feel more like partners in this process.

Thanks, Justin. Next time I will mark triage decision after test runs completely and successfully. 

I pushed this to try again, and I am waiting for result. I will mark "leo?" again once I got possitive result from tests.

https://tbpl.mozilla.org/?tree=Try&rev=ec9b9b6db603
blocking-b2g: leo? → ---
> I pushed this to try again, and I am waiting for result. I will mark "leo?"
> again once I got possitive result from tests.
> 
> https://tbpl.mozilla.org/?tree=Try&rev=ec9b9b6db603

Try server tested this completely and we got positive result of this patch.
blocking-b2g: --- → leo?
Keywords: checkin-needed
Hi Alan & Justin,
Since we have a crash on modem, Bug 886217, we have tested a lot using marionette scripts. During the test, what we found is that when the system enters to OOM state because of memory leakage in b2g process, it starts to kill the OOP applications one by one. After this happens a number of times finally it kills Homescreen App and then tries to reopen Homescreen App again but it couldn't open it properly. And kill & reopen it again & again... 
I think we need to take this patch for 1.1 to fix this unexpected behavior of LMK.
I will change the flag to leo+. Please consider this.
blocking-b2g: leo? → leo+
(In reply to Jinho Hwang [:jeffhwang] from comment #24)
> I think we need to take this patch for 1.1 to fix this unexpected behavior
> of LMK.
> I will change the flag to leo+. Please consider this.

I wouldn't expect this patch to fix the problem you describe.

If the phone is so low on memory that the homescreen immediately dies, then that's probably because the main process is using a lot of memory.  That's a problem, but it does not have anything to do with this fix.
m-c may be closed for awhile, so I've gone ahead and pushed to b2g18 based on being green on birch.

https://hg.mozilla.org/releases/mozilla-b2g18/rev/561f78c43eb7
Target Milestone: --- → 1.1 QE4 (15jul)
https://hg.mozilla.org/mozilla-central/rev/a5c067c78ffc
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Hi, Alan and all,

Thanks for your effort on this bug.
The patch works as we expected behavior.
The following is the detail information that QA collected.
-----------------------------------------------------------------
* Defect/Bug information:
  While user launched over than 4 apps on Fx OS, the  apps will only exist 2~3 apps.
  The others apps are "randomly" killed by memory management process of kernel.
  Please refer to the following video.
  -> https://dc1.safesync.com/LMmDYKWv/VIDEO0008.mp4?a=kwBNaKoMF6Q

* Root cause:
  In "lowmemorykiller.c", the wrong order of minfree caused this problem.

* Reproduce steps:
  1. Launching 5 apps on Fx OS. (Ex: dialler -> Message -> Contact -> Browser -> FM radio -> Settings)
  2. Switch to card view
  @ Expected result: The card view shows apps preview page by "sequence"
      (Settings -> FM radio -> Browser -> Contact -> Message -> dialler)
      Maybe "Message" app and dialler app are killed by system by default behavior.
  @ Actual result: The launched apps are randomly killed by system and only existing 2~3 apps on card view.

* Verified build:
   + Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10
      - HG - Gecko : "a078412b1671"
      - HG - Gaia : "NA"
      - Git - Gecko: "38e6b257a832fbd6c91fac54bf8a8c5d3572e3fb"
      - Git - Gaia: "b12b6f8895427708fad4e25af180266085bd2e74"
   + Mozilla-b2g18_v1_1_0_hd-unagi/2013-07-03-07-02-14
      - HG - Gecko : "6f874c20e68e"
      - HG - Gaia : "NA"
      - Git - Gecko: "NA"
      - Git - Gaia: "b12b6f8895427708fad4e25af180266085bd2e74"

* Verified methods:
  1. Command line:
      william@tpemozilla:~/Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10/$ adb shell cat /sys/module/lowmemorykiller/parameters/minfree
      1024,1280,1536,1792,2048,5120 ->>(Pass, it shows in increasing order)
      william@tpemozilla:~/Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10/$ adb shell cat /sys/module/lowmemorykiller/parameters/adj
       0,1,2,3,4,6 (Pass, it shows in increasing order)
  2. Follow the reproduction steps I mentioned. You could click the following link to know the detail.
      -> https://dc1.safesync.com/LMmDYKWv/VIDEO0009.mp4?a=fE0C7WFP6K0
Status: RESOLVED → VERIFIED
Whiteboard: [c= ] → [c= ][LeoVB+]
You need to log in before you can comment on or make changes to this bug.