Closed
Bug 887192
Opened 11 years ago
Closed 11 years ago
GonkHal should write lowmemorykiller parameters in increasing order
Categories
(Firefox OS Graveyard :: General, defect)
Tracking
(blocking-b2g:leo+, firefox23 wontfix, firefox24 wontfix, firefox25 fixed, b2g18? fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix, b2g-v1.1hd fixed)
People
(Reporter: alan.yenlin.huang, Assigned: alan.yenlin.huang)
References
Details
(Keywords: perf, Whiteboard: [c= ][LeoVB+])
Attachments
(1 file, 3 obsolete files)
I found we now have decreasing order of lowmemorykiller parameters from "/sys/module/lowmemorykiller/parameters/minfree" and "/sys/module/lowmemorykiller/parameters/adj".
$ adb shell cat /sys/module/lowmemorykiller/parameters/adj
6,4,3,2,1,0
$ adb shell cat /sys/module/lowmemorykiller/parameters/minfree
5120,2048,1792,1536,1280,1024
In lowmemorykiller.c, it assume that these parameters are in increasing order. It calculates # of free pages, and begins to compare the # of frees pages with the first minfree parameters and so on. This loop breaks when # of free pages is less than the minfree parameter it has in hand.
in lowmem_shrink of lowmemorykiller.c:
> for (i = 0; i < array_size; i++) {
> if (other_free < lowmem_minfree[i] &&
> other_file < lowmem_minfree[i]) {
> min_adj = lowmem_adj[i];
> break;
> }
> }
So if we set lowmemorykiller parameters in decreasing order, we will have unexpected result of lowmem_shrink in lowmemorykiller.c.
I think this bug come from Bug 870181 patch part 1, and the fix should be easy: just invert the loop through NUM_PROCESS_PRIORITY in EnsureKernelLowMemKillerParamsSet() form GonkHal.cpp.
Assignee | ||
Updated•11 years ago
|
blocking-b2g: --- → tef?
Assignee | ||
Comment 1•11 years ago
|
||
Hello Justin,
I just invert the loop which go through all process priorities in this patch. Please help to review this, thanks!
Attachment #767802 -
Flags: review?(justin.lebar+bug)
Comment 2•11 years ago
|
||
Comment on attachment 767802 [details] [diff] [review]
Invert the loop through NUM_PROCESS_PRIORITY
Ouch, thanks for finding this bug, Alan.
Can you please add an assertion in the code that we set these params in the right order, and add a comment briefly explaining why this is important?
Attachment #767802 -
Flags: review?(justin.lebar+bug) → feedback+
Comment 3•11 years ago
|
||
As I read the kernel code, the effect of this bug is that when there's less than 5120*4kb = 20mb free, we'll always kill some process, even if there are no bg processes present. After this bug is fixed, we'll kill a fg process only when there's less than 1280 * 4kb = 5mb free.
So this patch lets fg apps use 15mb more memory, which is great.
However, with or without this patch, we still kill processes in strict order of priority.
Is that right?
I'm not totally sure that we want to take this for 1.1, because the effect of letting processes use more memory is that we'll have less memory available for the file cache, which has an impact on performance. OTOH, this effectively expands the amount of memory on our phone significantly, which should help us avoid some OOM crashes, and I know partners care dearly about those.
blocking-b2g: tef? → leo?
Comment 4•11 years ago
|
||
This might explain some of the behavior we were seeing when receiving phone calls & memory pressure events at the same time, I definitely need to re-test with this patch applied.
BTW We're in Taipei with Leo & Tef teams so I can raise this issue during today's meetings and see if they're interested in having the patch uplifted.
Comment 5•11 years ago
|
||
> BTW We're in Taipei with Leo & Tef teams so I can raise this issue during today's meetings and see
> if they're interested in having the patch uplifted.
Great, let me know if there's anything I can do to help.
Comment 6•11 years ago
|
||
We'd like to get the leo+ status for this bug because it can bring significant benefits:
- As Justin pointed out in comment 3 it will enable the fg application to get 15 more megs of free memory which is significant on 256MiB devices
- It will make the order in which applications are killed due to OOM more predictable. This is important also for devices with lots of memory; with many applications open the user might find that certain important applications are killed unexpectedly even though they should keep running (such as the music player, see bug 824000)
If we can get the patch through our regular tests the risk factor for taking this patch should be fairly low. The reason is that we already exercise this functionality extensively and we would only be changing its configuration, not the logic behind it. Additionally the affected code is nearly identical between b2g18 and central so we don't need to land different patches. If we can prove in a test that applications will get killed in the proper order then I think we should be pretty safe.
Blocks: 824000
Assignee | ||
Comment 7•11 years ago
|
||
Patch v2, I also add assertion to make sure we won't break this rule in future.
Attachment #767802 -
Attachment is obsolete: true
Attachment #768148 -
Flags: review?(justin.lebar+bug)
Comment 8•11 years ago
|
||
> - It will make the order in which applications are killed due to OOM more predictable.
I was trying to say in comment 3 that I don't think this is the case. I'm looking again and I /think/ I was right, but I'm not sure.
I'm pretty tired; can you look at the kernel source and let me know what you think?
Comment 9•11 years ago
|
||
What's weird is, it looks like we'll only kill processes with oom_adj 6 using the lmk, since selected_oom_adj will always be 6.
But then why do we ever see the LMK killing fg apps? I've also seen the OOM killer get involved, and that makes sense. But it's not the only thing that kills fg apps. So I feel like I'm missing something here.
This might explain why I've seen the homescreen app (oom_adj 4) outlive a fg app (oom_adj 2). That's certainly bad, and this patch will probably fix that. I'm just not expecting miracles here.
Comment 10•11 years ago
|
||
Comment on attachment 768148 [details] [diff] [review]
Write LMK parameters in increasing order, and add assertion when someone breaks the rule
>@@ -1074,21 +1077,33 @@ EnsureKernelLowMemKillerParamsSet()
> int32_t killUnderMB;
> if (!NS_SUCCEEDED(Preferences::GetInt(
> nsPrintfCString("hal.processPriorityManager.gonk.%s.KillUnderMB",
> ProcessPriorityToString(priority)).get(),
> &killUnderMB))) {
> MOZ_CRASH();
> }
>
>+ // LMK in kernel calculates # of free pages, and begins to compare the #
>+ // of frees pages with the first minfree parameters and so on. The loop
>+ // in LMK breaks when # of free pages is less than the minfree parameter
>+ // it has in hand. So, we have to make sure our parameters are in
>+ // increasing order
It's probably better to say "number" instead of "#".
Reading this, I don't think we can describe in a comment what's going on well
enough for someone who isn't familiar with this bug to understand this. We
should just say something like
The LMK silently malfunctions if we assign the parameters in non-decreasing
order; see bug 887192.
>+ MOZ_ASSERT(oomScoreAdj > lowerBoundOfNextOomScoreAdj &&
Nit: Trailing space.
>+ killUnderMB > lowerBoundOfNextKillUnderMB);
Also, please split this assertion in two.
> // adj is in oom_adj units.
> adjParams.AppendPrintf("%d,", OomAdjOfOomScoreAdj(oomScoreAdj));
>
> // minfree is in pages.
> minfreeParams.AppendPrintf("%d,", killUnderMB * 1024 * 1024 / PAGE_SIZE);
>+
>+ // Update the lower buonds of next oomScoreAdj and killUnderMB parameters,
s/,/./
s/buonds/bounds/
Actually, I don't think you need this comment at all. :)
>+ lowerBoundOfNextOomScoreAdj = oomScoreAdj;
>+ lowerBoundOfNextKillUnderMB = killUnderMB;
> }
Attachment #768148 -
Flags: review?(justin.lebar+bug) → review+
Comment 11•11 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #8)
> I'm pretty tired; can you look at the kernel source and let me know what you
> think?
Sure, I'll have a look ASAP.
Assignee | ||
Comment 12•11 years ago
|
||
Revise comments and split assertion statement into two.
I tested this on my unagi, both debug and non-debug build. I also pushed this to try server and waiting for result. See https://tbpl.mozilla.org/?tree=Try&rev=bf7acaa7ad89 .
Attachment #768148 -
Attachment is obsolete: true
Attachment #768281 -
Flags: review?(justin.lebar+bug)
Comment 13•11 years ago
|
||
Someone on #developers noticed that the try push for this patch used:
"try: -b do -p all -u all -t none"
This causes 160-180 hours of machine time to be used - when the patch was B2G only.
Please can you choose more specific platforms using http://trychooser.pub.build.mozilla.org/ when you don't need to test everywhere, to save using resources unnecessarily (which in the case of the EC2 run jobs, directly costs $$$ per push).
Thanks! :-)
Assignee | ||
Comment 14•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #13)
> Someone on #developers noticed that the try push for this patch used:
> "try: -b do -p all -u all -t none"
>
> This causes 160-180 hours of machine time to be used - when the patch was
> B2G only.
I got it, thanks for reminding.
I didn't notice I am just doing this in GonkHal. I stopped the try push, thanks!
Comment 15•11 years ago
|
||
> Attachment #768281 [details] [diff] - Flags: review?(justin.lebar+bug@gmail.com)
I gave you r+ on the last patch, so you don't need to ask for another review.
If on the other hand there's something here that you'd like me to look at, I'm happy to; just let me know what it is.
Updated•11 years ago
|
Attachment #768281 -
Flags: review?(justin.lebar+bug)
Comment 16•11 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #6)
> We'd like to get the leo+ status for this bug because it can bring
> significant benefits:
>
> ...
>
> If we can get the patch through our regular tests the risk factor for taking
> this patch should be fairly low...
From Triage:
Since the TBPL tests didn't complete we don't have the low risk confirmation that Gabriele mentioned in comment #6 so the risk is too high for leo+. Nominating for koi and tracking.
Status: NEW → ASSIGNED
blocking-b2g: leo? → koi?
tracking-b2g18:
--- → ?
Keywords: perf
Whiteboard: [c= ]
Comment 17•11 years ago
|
||
Maybe we misunderstood each other.
Whether or not this patch passes tests by the time triage happens to look at the patch has nothing to do with how risky the patch is. That has to do with how loaded our infrastructure is. We're not making blocking decisions based on infra load, right?
I'm pretty sure Gabriele was saying in comment 6 that he thinks this patch will be safe, in the sense that we'd notice any regressions as part of our automated test suite. I don't think he was saying that you should not approve for blocking if triage happened to occur before the automated tests completed.
blocking-b2g: koi? → leo?
Comment 18•11 years ago
|
||
Allan, can you please push this to try again (for the relevant platforms) so release drivers can make an informed decision here?
Comment 19•11 years ago
|
||
Next time you would like completed test runs before making a triage decision, I would really appreciate it if you'd consider asking for those and waiting, instead of proactively marking the bug as blocking-minus. That would make us in engineering feel more like partners in this process.
Assignee | ||
Comment 20•11 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #19)
> Next time you would like completed test runs before making a triage
> decision, I would really appreciate it if you'd consider asking for those
> and waiting, instead of proactively marking the bug as blocking-minus. That
> would make us in engineering feel more like partners in this process.
Thanks, Justin. Next time I will mark triage decision after test runs completely and successfully.
I pushed this to try again, and I am waiting for result. I will mark "leo?" again once I got possitive result from tests.
https://tbpl.mozilla.org/?tree=Try&rev=ec9b9b6db603
blocking-b2g: leo? → ---
Assignee | ||
Comment 21•11 years ago
|
||
Attachment #768281 -
Attachment is obsolete: true
Assignee | ||
Comment 22•11 years ago
|
||
> I pushed this to try again, and I am waiting for result. I will mark "leo?"
> again once I got possitive result from tests.
>
> https://tbpl.mozilla.org/?tree=Try&rev=ec9b9b6db603
Try server tested this completely and we got positive result of this patch.
blocking-b2g: --- → leo?
Assignee | ||
Updated•11 years ago
|
Keywords: checkin-needed
Comment 23•11 years ago
|
||
Keywords: checkin-needed
Comment 24•11 years ago
|
||
Hi Alan & Justin,
Since we have a crash on modem, Bug 886217, we have tested a lot using marionette scripts. During the test, what we found is that when the system enters to OOM state because of memory leakage in b2g process, it starts to kill the OOP applications one by one. After this happens a number of times finally it kills Homescreen App and then tries to reopen Homescreen App again but it couldn't open it properly. And kill & reopen it again & again...
I think we need to take this patch for 1.1 to fix this unexpected behavior of LMK.
I will change the flag to leo+. Please consider this.
Updated•11 years ago
|
blocking-b2g: leo? → leo+
Comment 25•11 years ago
|
||
(In reply to Jinho Hwang [:jeffhwang] from comment #24)
> I think we need to take this patch for 1.1 to fix this unexpected behavior
> of LMK.
> I will change the flag to leo+. Please consider this.
I wouldn't expect this patch to fix the problem you describe.
If the phone is so low on memory that the homescreen immediately dies, then that's probably because the main process is using a lot of memory. That's a problem, but it does not have anything to do with this fix.
Comment 26•11 years ago
|
||
m-c may be closed for awhile, so I've gone ahead and pushed to b2g18 based on being green on birch.
https://hg.mozilla.org/releases/mozilla-b2g18/rev/561f78c43eb7
status-b2g18:
--- → fixed
status-b2g18-v1.0.0:
--- → wontfix
status-b2g18-v1.0.1:
--- → wontfix
status-b2g-v1.1hd:
--- → affected
status-firefox23:
--- → wontfix
status-firefox24:
--- → wontfix
status-firefox25:
--- → fixed
Target Milestone: --- → 1.1 QE4 (15jul)
Comment 27•11 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 28•11 years ago
|
||
Comment 29•11 years ago
|
||
Hi, Alan and all,
Thanks for your effort on this bug.
The patch works as we expected behavior.
The following is the detail information that QA collected.
-----------------------------------------------------------------
* Defect/Bug information:
While user launched over than 4 apps on Fx OS, the apps will only exist 2~3 apps.
The others apps are "randomly" killed by memory management process of kernel.
Please refer to the following video.
-> https://dc1.safesync.com/LMmDYKWv/VIDEO0008.mp4?a=kwBNaKoMF6Q
* Root cause:
In "lowmemorykiller.c", the wrong order of minfree caused this problem.
* Reproduce steps:
1. Launching 5 apps on Fx OS. (Ex: dialler -> Message -> Contact -> Browser -> FM radio -> Settings)
2. Switch to card view
@ Expected result: The card view shows apps preview page by "sequence"
(Settings -> FM radio -> Browser -> Contact -> Message -> dialler)
Maybe "Message" app and dialler app are killed by system by default behavior.
@ Actual result: The launched apps are randomly killed by system and only existing 2~3 apps on card view.
* Verified build:
+ Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10
- HG - Gecko : "a078412b1671"
- HG - Gaia : "NA"
- Git - Gecko: "38e6b257a832fbd6c91fac54bf8a8c5d3572e3fb"
- Git - Gaia: "b12b6f8895427708fad4e25af180266085bd2e74"
+ Mozilla-b2g18_v1_1_0_hd-unagi/2013-07-03-07-02-14
- HG - Gecko : "6f874c20e68e"
- HG - Gaia : "NA"
- Git - Gecko: "NA"
- Git - Gaia: "b12b6f8895427708fad4e25af180266085bd2e74"
* Verified methods:
1. Command line:
william@tpemozilla:~/Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10/$ adb shell cat /sys/module/lowmemorykiller/parameters/minfree
1024,1280,1536,1792,2048,5120 ->>(Pass, it shows in increasing order)
william@tpemozilla:~/Mozilla-b2g18-unagi-eng/2013-07-03-07-02-10/$ adb shell cat /sys/module/lowmemorykiller/parameters/adj
0,1,2,3,4,6 (Pass, it shows in increasing order)
2. Follow the reproduction steps I mentioned. You could click the following link to know the detail.
-> https://dc1.safesync.com/LMmDYKWv/VIDEO0009.mp4?a=fE0C7WFP6K0
Status: RESOLVED → VERIFIED
Updated•11 years ago
|
Whiteboard: [c= ] → [c= ][LeoVB+]
You need to log in
before you can comment on or make changes to this bug.
Description
•