Closed Bug 974308 Opened 10 years ago Closed 9 years ago

[tarako] Study memory thrashing and find out a proper way to handle it.

Categories

(Firefox OS Graveyard :: Performance, defect, P3)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:-, tracking-b2g:-, b2g-v1.3T affected)

RESOLVED INCOMPLETE
blocking-b2g -
tracking-b2g -
Tracking Status
b2g-v1.3T --- affected

People

(Reporter: sinker, Assigned: ting)

References

Details

(Keywords: perf, Whiteboard: [c=memory p= s= u=tarako])

Attachments

(9 files, 7 obsolete files)

154 bytes, text/html
Details
154 bytes, text/html
Details
154 bytes, text/html
Details
5.62 KB, patch
Details | Diff | Splinter Review
1.28 MB, application/gzip
Details
154 bytes, text/html
Details
154 bytes, text/html
Details
154 bytes, text/html
Details
6.17 KB, patch
Details | Diff | Splinter Review
This bug is to study memory thrashing of 128MB devices, and find out a way to improve the response time.

The tarako would be totally blocked for a few seconds.  It is some kind of memory thrashing.  The solution can be changing parameters of OOM killer or a customized thrashing aware killer.
Attached file http://goo.gl/fNPxny (obsolete) —
Modified vmstat a bit to gather the system information we're intrested. Will collect yaffs numbers as well later.
Attachment #8378186 - Attachment description: the spreadsheet of collected system information → http://goo.gl/fNPxny
Attachment #8378186 - Attachment mime type: text/plain → text/url
Attachment #8378186 - Attachment mime type: text/url → text/x-url
blocking-b2g: --- → 1.3T?
Attachment #8378186 - Attachment mime type: text/x-url → application/x-url
Attached file high_sys_load#1 (obsolete) —
Attachment #8378186 - Attachment is obsolete: true
Hi Ting, what the number is responsible for response time of the device?  I mean how do I know the device is blocked or not at a given period?
Could you also explain what these fields meaning?
Comment on attachment 8378206 [details]
high_sys_load#1

running     = /procs/stat procs_running
blocked     = /procs/stat procs_blocked
free        = /proc/meminfo MemFree
cached      = /proc/meminfo Cached
mapped      = /proc/meminfo Mapped
anonomous   = /proc/meminfo AnonPages
zram        = /sys/block/zram0/mem_used_total
major fault = /proc/vmstat pgmajfault
swap in     = /proc/vmstat pswpin
swap out    = /proc/vmstat pswpout
us          = /proc/stat cpu user
ni          = /proc/stat cpu nice
sy          = /proc/stat cpu system
id          = /proc/stat cpu idle
wa          = /proc/stat cpu iowait
ir          = /proc/stat cpu irq
si          = /proc/stat cpu softirq

For the attachment 8378206 [details], system loading is always around 99, which is not
responding. I will add yaffs numbers and iterate more times to understand the
status before entering high system loading.
Attachment #8378206 - Attachment description: spreadsheet_collected-stats → high_sys_load#1
Attachment #8378206 - Attachment filename: spreadsheet.html → high_sys_load#1.html
Attached file high_sys_load#2 (obsolete) —
Recorded another log, added two numbers:

  yaffs_n_page_reads  = summation of /proc/yaffs n_page_reads
  yaffs_n_page_writes = summation of /proc/yaffs n_page_writes

and dropped 2:

  cpu_irq
  cpu_softirq

System loading is showed as stepped area at right vertical axis. Be noted system
loading goes up when yaffs_n_page_reads increases.
Attached file high_system_load#1
Updated link to the chart.
Attachment #8378206 - Attachment is obsolete: true
Attached file high_system_load#2
Attachment #8378815 - Attachment is obsolete: true
Attached file high_system_load#3
Captured another log, the high load started from tapping message button in a contact. I then tried to edit message and pressed keyboard several times.
I am thinking:

- Seek chances to improve NAND read performance
  - Read ahead and utilize chip's feature to read with data cache
  - bug 944457, comment 211 shows reading smaller file requires 200~400us for
    reading a page (2KB), but from NAND chip's spec, it takes 25us maximum
  - This will be a system wide improvement if we can make any

- Avoid swapping by killing background application earlier
  - LMK actually is not triggered when the min_free level is reached. It is a
    slab shrinker, which is called after kernel has tried hard to reclaim pages
  - Possible options:
    - ulmkd, a user space low memory killer daemon
    - cgroup memory pressure notification, but it is introduced in kernel 3.10
    - Invokes LMK by ourselves whenever we are going to launch an application
  - Will we need some rules for killing, e.g., not to kill background music
    application?

- Others worth checking
  - Swap token, prohibits a process from page frame reclaiming for a period
(In reply to Ting-Yu Chou [:ting] from comment #10)
> - Avoid swapping by killing background application earlier
>   - LMK actually is not triggered when the min_free level is reached. It is a
>     slab shrinker, which is called after kernel has tried hard to reclaim pages

I was wrong, LMK is called quite frequently, but usually return earlier because 1) nr_to_scan == 0, or 2) NR_FILE_PAGES - NR_SHMEM - total_swapcache_pages is larger than maximum min_free.
So there are two ways to go in general:

1. Reduce the working set.
1.1 Kill the apps in comment 10.
1.2 We may as well suspend/sigstop the background apps. Compared to suspension, to kill an app releases it's dirty pages. Note that the benefit differs only by the compressed size provided that the overhead of zram is acceptable. Or to combine them; Killing is the last resort.
1.3 De-prioritize background apps so that they are unable to compete for precious frames. We may also make the message queue in chrome-process aware of process priority.

2. Improve IO
2.1 Improve the flash performance in comment 10.
2.2 Just in case, to make sure the filesystem is correctly aligned to the physical boundary that the flash controller can access efficiently.
2.3 A compressed filesystem to trade cpu times for less IO.
(In reply to Ting-Yu Chou [:ting] from comment #10)
> - cgroup memory pressure notification, but it is introduced in kernel 3.10

Just a note on this, we've experimented with cgroups in the past and even with kernels that supported it we run into very annoying bugs so we decided not to use it at the time. From what I could see the cgroups code in most of our kernels contains some Android-specific modifications compared to the Linux mainline kernel and those seem to have made it quite fragile.
(In reply to Gabriele Svelto [:gsvelto] from comment #13)
> Just a note on this, we've experimented with cgroups in the past and even
> with kernels that supported it we run into very annoying bugs so we decided
> not to use it at the time. From what I could see the cgroups code in most of
> our kernels contains some Android-specific modifications compared to the
> Linux mainline kernel and those seem to have made it quite fragile.

Gabriele, thank you for the information.
Attached patch kill-bg-apps.patch (obsolete) — Splinter Review
This wip patch implements a strategy that is as follows:
- when an app goes into the background, we add it to a list of killable app. The homescreen and apps that have the PERCEIVABLE_BACKGROUND priority don't make it in this list.
- when an app goes into the foreground, we try to kill all current background apps except the one that is moving to the foreground.

That doesn't seem too bad on a tarako. I've been playing a bit with launching apps while the music player is active in the background and while we still swap sometimes I did not experience long freezes.

Still to do: make that work properly with activities, ie. ensure we don't kill the activity caller and also maybe iac users (not sure if iac matters for 1.3 as the important user is the music app that we don't kill for other reasons).

The implementation is a quick hack, but if some of you could test it and also feel it's an improvement I'll clean it up (and add a pref to only use that on low memory devices).
Attached patch kill-bg-apps.patch (obsolete) — Splinter Review
I found that IsPreallocated() was still true on the first time we set an app the FOREGROUND priority because of the order of the test in MaybeTakePreallocatedAppProcess. This patch changes that, and that prevents killing the preallocated app which was sometimes happening with the previous patch.
Attachment #8380088 - Attachment is obsolete: true
(In reply to Fabrice Desré [:fabrice] from comment #16)
> The implementation is a quick hack, but if some of you could test it and
> also feel it's an improvement I'll clean it up (and add a pref to only use
> that on low memory devices).
Thank you Fabrice, I just passed this information to Gonk team, see if they can include the patch in today's build and have more people try it out. I will also try it today.
(In reply to Fabrice Desré [:fabrice] from comment #15)
> Created attachment 8380088 [details] [diff] [review]
> kill-bg-apps.patch
> 
> This wip patch implements a strategy that is as follows:
> - when an app goes into the background, we add it to a list of killable app.
> The homescreen and apps that have the PERCEIVABLE_BACKGROUND priority don't
> make it in this list.
> - when an app goes into the foreground, we try to kill all current
> background apps except the one that is moving to the foreground.
There are some concerns about "kill all current background apps except the one that is moving to the foreground."
We tested with playing music at background, then send SMS with attachment, with your patch, the music playback sometimes will be killed. It impacted user experience.
How about kill few apps with priority rather than kill all current background apps except one?
This patch should kill the music app (and was not for me) when it's playing music since it has then the BACKGROUND_PERCEIVABLE level and not BACKGROUND.
(In reply to Fabrice Desré [:fabrice] from comment #19)
> This patch should kill the music app (and was not for me) when it's playing

should *not* kill...
Hm, I think I know why you could hit that. I'll post a new patch tomorrow.
Attachment #8380375 - Attachment is obsolete: true
(In reply to Fabrice Desré [:fabrice] from comment #15)
> Still to do: make that work properly with activities, ie. ensure we don't
> kill the activity caller and also maybe iac users (not sure if iac matters
> for 1.3 as the important user is the music app that we don't kill for other
> reasons).

Currently that's not a problem because apps invoking inline activities are still considered in the foreground. Until we fix bug 892371 that's still going to be the case. (And yes, I've wanted to fix that bug for a long time but still haven't got enough spare time to do so.)
Thinker prefers to improve the code above gonk, which benefits not only Tarako but also other projects. So I'll leave improving NAND read as the lowest priority. Now he'd like to test followings to see if it prevents system from freezing:

a. decrease the priority of either b2g or foreground content process
b. suspend background applications

Short summary: Both does not help, no matter it's done before or when system freezes.

The steps I used to reproduce the system freeze with reference-workload-light:

1. play music in background
2. launch browser
3. launch message app, enter BIG-THREAD-MMS, scroll up/down rapidly, and press
   home key when the message is loading

For a), I renice either b2g or foreground content process to 19. For b), I suspended homescreen, music, and browser.
(In reply to Gabriele Svelto [:gsvelto] from comment #22)
> (In reply to Fabrice Desré [:fabrice] from comment #15)
> > Still to do: make that work properly with activities, ie. ensure we don't
> > kill the activity caller and also maybe iac users (not sure if iac matters
> > for 1.3 as the important user is the music app that we don't kill for other
> > reasons).
> 
> Currently that's not a problem because apps invoking inline activities are
> still considered in the foreground. Until we fix bug 892371 that's still
> going to be the case. (And yes, I've wanted to fix that bug for a long time
> but still haven't got enough spare time to do so.)

It's very much a problem for non-inline activities, and we need to support them. I'll take another look at 892371 today.
New patch - I realized I could just reuse the BackgroundProcessLRUPool to track background apps instead of re-implementing that. I also added a pref to easily turn that on/off since we won't need that on all devices.
Fabrice, last time you asked me would it be helpful if letting application load less things. Could you please point me out where is the code about this? Thank you.
Flags: needinfo?(fabrice)
(In reply to Ting-Yu Chou [:ting] from comment #26)
> Fabrice, last time you asked me would it be helpful if letting application
> load less things. Could you please point me out where is the code about
> this? Thank you.

I don't remember what this was about :(
I've seen very low i/o performance when loading resources from app packages (that happens in nsJARChannel.cpp). Maybe we trash more because we don't limit the number of simultaneous loads. That may be worthwhile to try having a limited pool of jar channels.
Flags: needinfo?(fabrice)
I added a log into yaffs_readpage_nolock() with serial number, filename, and offset. Then tried to reproduce the system freeze by:

  1. play music in background
  2. launch browser and back to home screen
  3. launch dialer and enter call log, scroll a bit and select a contact
  4. press message button at contact's page
  5. press "Message" to launch keyboard

Used shell command to capture dmesg:

  $ while true; do adb shell dmesg -c >> dmesg.out; sleep 1; done

The system freezed during 1190.0 ~ 1410.5, from serial number there were 212966 yaffs_readpage_nolock() invocations, 96868 of them were captured:

  63591 /system/b2g/libxul.so
   6623 /system/lib/libMali.so
   5088 /system/lib/libc.so
   4025 /system/bin/toolbox
   2102 /system/b2g/libmozglue.so
   2020 /system/b2g/libnss3.so
   1800 /system/bin/mksh
   1434 /system/lib/libcutils.so
   1379 /system/lib/libm.so
    767 /system/bin/linker
    685 /system/lib/libui.so
    549 /system/lib/libEGL.so
    460 /system/b2g/omni.ja
    434 /system/lib/libutils.so
    433 /system/b2g/b2g
    375 /system/lib/libstlport.so
    365 /data/local/webapps/keyboard.gaiamobile.org/application.zip
    323 /system/lib/hw/audio.primary.sc8810.so
    305 /data/local/storage/persistent/chrome/idb/3406066227csotncta.sqlite
    299 /system/lib/libusbhost.so
    269 /data/local/webapps/sms.gaiamobile.org/application.zip
    260 /system/lib/liblog.so
    236 /system/bin/slog
    201 /system/lib/libstdc++.so
    174 /system/lib/libUMP.so
    157 /system/lib/libreference-ril_sp.so
    150 /system/lib/libstagefright.so
    146 /system/lib/libbinder.so
    142 /system/lib/libmedia.so
    130 /system/fonts/FiraSansOT-Regular.otf
    124 /system/lib/hw/gralloc.sc8810.so
    123 /data/local/storage/persistent/chrome/idb/2588645841ssegtnti.sqlite
    117 /system/lib/hw/sensors.sc8810.so
    111 /system/bin/phoneserver_2sim
     98 /system/lib/libril_sp.so
     96 /system/lib/libwilhelm.so
     95 /data/local/storage/persistent/chrome/idb/226660312ssm.sqlite
     91 /system/lib/libaudioflinger.so
     86 /data/local/webapps/system.gaiamobile.org/application.zip
     75 /system/fonts/FiraSansOT-Light.otf
     75 /system/bin/eng_rf_nv_update
     72 /system/lib/libGLESv1_CM.so
     66 /system/bin/engpcclient
     57 /system/lib/libpixelflinger.so
     56 /system/b2g/libfreebl3.so
     54 /system/lib/libGLESv2.so
     54 /system/lib/egl/libGLESv2_mali.so
     47 /system/lib/egl/libEGL_mali.so
     40 /system/fonts/FiraSansOT-Medium.otf
     39 /system/bin/vold
     37 /system/lib/libstagefright_sprd_mp3dec.so
     36 /system/bin/netd
     32 /system/bin/vcharged
     32 /system/b2g/libsoftokn3.so
     27 /system/lib/libstagefright_omx.so
     27 /system/bin/b2g-info
     24 /system/lib/libstagefright_foundation.so
     21 /system/bin/rild_sp
     14 /system/lib/libsysutils.so
     14 /system/lib/hw/audio_policy.default.so
     14 /system/bin/rilproxy
     12 /system/lib/libhardware_legacy.so
      8 /data/local/tmp/mozilla-temp-337349264
      6 /system/lib/libmediaplayerservice.so
      6 /data/local/storage/persistent/8+f+app+++music.gaiamobile.org/idb/598554340MFe2d%icaiDsBu%m2F.sqlite
      5 /data/local/storage/persistent/chrome/idb/2588645841ssegtnti.sqlite-journal
      4 /system/bin/usbmond
      3 /system/bin/servicemanager
      2 /system/bin/b2g-ps
      2 /sys/system/b2g/libxul.so
      1 /system/bin/oom-msg-logger
      1 /system/b2g/plugin-container
      1 /data/local/tmp/etilqs_oX0DRIUiwhiDL1z
      1 /data/local/storage/persistent/chrome/idb/3406/system/b2g/libxul.so
(In reply to Ting-Yu Chou [:ting] from comment #28)
>   63591 /system/b2g/libxul.so

Check http://goo.gl/3zZt6L for the number of times yaffs_readpage_nolock() reads libxul.so for each offset. The distribution is diffuse.

BTW, I couldn't map offset 0x00085000 ~ 0x00229000 to any section of libxul.so, still not quite sure how to resolve them to symbols.
(In reply to Ting-Yu Chou [:ting] from comment #29)
> BTW, I couldn't map offset 0x00085000 ~ 0x00229000 to any section of
> libxul.so, still not quite sure how to resolve them to symbols.

It maps to text section, I mixed up the address and offset.
Thinker & Ting-Yuan are thinking to let linker reorder the correlated functions of libxul.so to have them stayed closer. And then I found PGO (bug 970175), I will try to enable it and see how does it go.
See Also: → 603370
I noticed nsIFrame.SchedulePaint() triggers RefreshDriver.ScheduleViewManagerFlush() at contact's profile page, even after pressing the message button and Messages app shows up. Not sure if it causes the freeze of comment 28.
(In reply to Ting-Yu Chou [:ting] from comment #32)
> I noticed nsIFrame.SchedulePaint() triggers
> RefreshDriver.ScheduleViewManagerFlush() at contact's profile page, even
> after pressing the message button and Messages app shows up. Not sure if it
> causes the freeze of comment 28.

Created bug 978712 for this.
Attachment #8380962 - Flags: review?(cyu)
triage: 1.3T+, this is needed for tarako to avoid sudden lockup of device
blocking-b2g: 1.3T? → 1.3T+
Modified gonk-misc/Android.mk to enable PGO:

  - $(MAKE) -C $(GECKO_PATH) -f client.mk -s && \
  + $(MAKE) -C $(GECKO_PATH) -f client.mk -s profiledbuild && \

but both arm-eabi-4.4.3 and arm-eabi-4.6 ran into:

  {standard input}: Assembler messages:
  {standard input}:112: Error: lo register required -- `cbz r9,2f'
  {standard input}:202: Error: lo register required -- `cbz r9,2f'
  make[7]: *** [/home/ting/w/fxos/tarako-v13t/objdir-gecko/security/nss/lib/freebl/mpi_arm.o] Error 1

Checked CFLAGS, there's only one difference from non-PGO build: "-fprofile-generage", is still trying to work around it.
Comment on attachment 8380962 [details] [diff] [review]
bg-apps-bye.patch

Review of attachment 8380962 [details] [diff] [review]:
-----------------------------------------------------------------

I remove the review flag because I don't think this is the route we should take, even we might finally decide to do this. If we set the OOM parameters and process priority correctly, we should able achieve the same or similar result. We should strive to fix the problem that background apps still do something (e.g. perform GC/CC) that causes swapping of the foreground app, exhausting the file cache and the like and cause such problem. Killing background apps should be used as the last resort.

::: b2g/app/b2g.js
@@ +620,1 @@
>  

Shouldn't we set it to false and only make it true on 1.3T instead?

::: dom/ipc/ProcessPriorityManager.cpp
@@ +1320,5 @@
> +  for (uint32_t i = 0; i < killed.Length(); i++) {
> +    ContentParent* parent = killed[i];
> +    LOG("About to force kill pid: %d", parent->Pid());
> +    RemoveFromBackgroundLRUPool(parent);
> +    unused << kill(parent->Pid(), SIGKILL);

We shouldn't kill() the process with SIGKILL in high-level code like ProcessPriorityManager. Use parent->KillHard() if we really need to.
Attachment #8380962 - Flags: review?(cyu)
I don't disagree that we need a better solution Cervantes, but we have a deadline in less than 2 weeks so it may be time to cut corners to get a stable build we can test. I'll post a new patch fixing your comment.
This patch doesn't handle activities because both the caller and the activity have foreground priority. If we still have music playing in background, there is no one to kill from the background LRU pool. In this case we can still have system freeze. We need to think what we can do for it:

1. kill the background music
2. kill the activity
3. kill the activity caller

If we all agree that long system freeze (e.g. more than 5 seconds) is the worst and needs to be avoided, then we need to kill someone. For such case we might as well extend the background LRU pool and kill the one with lowest priority and least used.
Determine system status per attachment 8378843 [details], and attachment 8378844 [details]:

  - going freezed: both pswpin & pswpout grows over 800 for 2 seconds
  - freezed: cpu system usage over 80 for 3 seconds
(In reply to Ting-Yu Chou [:ting] from comment #35)
>   {standard input}: Assembler messages:
>   {standard input}:112: Error: lo register required -- `cbz r9,2f'
>   {standard input}:202: Error: lo register required -- `cbz r9,2f'

The system image for generating profile comes out after following modifications, but it is too big to download to Tarako (189.1 MB).

diff --git a/security/nss/lib/freebl/Makefile b/security/nss/lib/freebl/Makefile
index 0d293f1..d30a910 100644
--- a/security/nss/lib/freebl/Makefile
+++ b/security/nss/lib/freebl/Makefile
@@ -201,7 +201,7 @@ ifeq ($(CPU_ARCH),x86)
     #ECL_USE_FP = 1
 endif
 ifeq ($(CPU_ARCH),arm)
-    DEFINES += -DMP_ASSEMBLY_MULTIPLY -DMP_ASSEMBLY_SQUARE 
+#    DEFINES += -DMP_ASSEMBLY_MULTIPLY -DMP_ASSEMBLY_SQUARE 
     DEFINES += -DMP_USE_UINT_DIGIT
     DEFINES += -DSHA_NO_LONG_LONG # avoid 64-bit arithmetic in SHA512
     MPI_SRCS += mpi_arm.c
diff --git a/client.mk b/client.mk
index 561a294..9dfe936 100644
--- a/client.mk
+++ b/client.mk
@@ -243,10 +243,10 @@ endif
 profiledbuild::
        $(MAKE) -f $(TOPSRCDIR)/client.mk realbuild MOZ_PROFILE_GENERATE=1 MOZ_P
        $(MAKE) -C $(PGO_OBJDIR) package MOZ_PGO_INSTRUMENTED=1 MOZ_INTERNAL_SIG
-       rm -f ${PGO_OBJDIR}/jarlog/en-US.log
-       MOZ_PGO_INSTRUMENTED=1 JARLOG_FILE=jarlog/en-US.log EXTRA_TEST_ARGS=10 $
-       $(MAKE) -f $(TOPSRCDIR)/client.mk maybe_clobber_profiledbuild
-       $(MAKE) -f $(TOPSRCDIR)/client.mk realbuild MOZ_PROFILE_USE=1
+#      rm -f ${PGO_OBJDIR}/jarlog/en-US.log
+#      MOZ_PGO_INSTRUMENTED=1 JARLOG_FILE=jarlog/en-US.log EXTRA_TEST_ARGS=10 $
+#      $(MAKE) -f $(TOPSRCDIR)/client.mk maybe_clobber_profiledbuild
+#      $(MAKE) -f $(TOPSRCDIR)/client.mk realbuild MOZ_PROFILE_USE=1
 
 #####################################################
 # Build date unification
Tried to enlarge min_free for LMK, which Ting-Yuan thinks it'd be helpful if we can keep ~20MB free+cache most of time. Meanwhile, Thinker prefers to keep as many background apps as possible (still with good user experience), so the test was done without Fabrice's patch (attachment 8380962 [details] [diff] [review]).

The test results:

  http://goo.gl/gh1JUS

  - Killing Homescreen doesn't feel all right, since I can see nothing but
    wallpaper for 1~2 seconds when back to home screen.
  - Smaller min_free earns us just one extra background app, which I prefer
    smoothness than that.
When you run this kind of tests, can you also add app launch time (just turn on the time to load option in the Settings app) ?
I wrote this (admittedly aggressive) background app kill patch because I noticed that even if we kept apps around, it was in many cases slower to move them back into the foreground than to start them from scratch. This means that the overall user experience was actually way better, and this is the goal here.
I (In reply to Fabrice Desré [:fabrice] from comment #42)
> When you run this kind of tests, can you also add app launch time (just turn
> on the time to load option in the Settings app) ?

I see.

> I wrote this (admittedly aggressive) background app kill patch because I
> noticed that even if we kept apps around, it was in many cases slower to
> move them back into the foreground than to start them from scratch. This
> means that the overall user experience was actually way better, and this is
> the goal here.

Yes, this is also my understanding. Just Thinker'd like to know how many background apps can be there with larger min_free, so I need to test it without your patch.
See Also: → 978712
See Also: → 981278
Ting, since you have been working on this, assign to you for now. please reassign if you are not working on this. thanks
Assignee: nobody → tchou
There were discussions about killing application:

  a) Extend background LRU to kill not just background app
  b) Polling conditions may cause thrashing, and kill beforehand
  c) Tuning LMK, probably more than one configuration for different situations
Hardly run into system freeze with this configuration today:

  adb shell "echo 256,512,1024,4608,4608,6144 > /sys/module/lowmemorykiller/parameters/minfree"

  oom_adj min_free
        0  1024 KB
        1  2048 KB
        2  4096 KB
        6 18432 KB
        8 18432 KB
       10 24576 KB

The cases I tested were:

  a) Start message activity from a contact's details page while playing music,
  b) Launch Gallery with 50 images while playing music, and
  c) Launch Marketplace while playing music

The drawback is Homescreen app does not recover fast enough, which I can see only wallpaper when back to home screen.

The tests were done without Fabrice's patch attachment 8380962 [details] [diff] [review], if the patch is included, min_free for oom_adj 10 can be a small number such as 1024 KB to let LMK never select a victim from oom_adj 10 (which there won't be anyone with Fabrice's patch).
Ting-Yu, what were the app startup times with your latests parameters? (if you can test both without and with my patch that would be awesome).
(In reply to Fabrice Desré [:fabrice] from comment #47)
> Ting-Yu, what were the app startup times with your latests parameters? (if
> you can test both without and with my patch that would be awesome).

Sure, then do you have any particular scenarios/applications would like to test?
(In reply to Ting-Yu Chou [:ting] from comment #48)
> (In reply to Fabrice Desré [:fabrice] from comment #47)
> > Ting-Yu, what were the app startup times with your latests parameters? (if
> > you can test both without and with my patch that would be awesome).
> 
> Sure, then do you have any particular scenarios/applications would like to
> test?

Not really. Whatever is covered by QA smoketests should be good.
Moving this to Runtime to get it out of the General component. This should probably go into a new "Performance" component.
Component: General → Runtime
Attached file time_to_load (obsolete) —
Fabrice, the smoketests are quite simple, so I tried to get loading time with different scenarios.
Attached file time_to_load
Fixed attachment content-type.
Attachment #8389727 - Attachment is obsolete: true
I am working on an active killer similar to fabrice's background killer. It's less aggressive and it can also kill non-background apps. This should act as our last resort and complement LMK parameter tuning.
Component: Runtime → Performance
We just had a team discussion, the conclusion is:

  - Will try to detect thrashing and kill application accordingly, which is a
    backup as we assume LMK can not handle all the cases, such as bug 973596.
    Currently there're two proposals for the detection: a) timer skew, and b)
    iowait.

  - The patch of bug 972130 and LMK min_free changes of bug 974308, comment 46
    will be included from next daily build. Will test it see if we still meet
    system freeze.

I am assigned to check whether (a) and (b) can be used as an indicator for bug
973596.
From the test results, neither timer skew nor iowait are suitable indicator for the case of bug 973596. Better confirm is it thrashing at first.
What can I do on my side?
Status: NEW → ASSIGNED
Priority: -- → P1
Whiteboard: [c=memory p= s= u=tarako]
(In reply to Ting-Yu Chou [:ting] from comment #55)
> Created attachment 8392767 [details]
> incoming_call_timer_skew_&_iowait
> 
> From the test results, neither timer skew nor iowait are suitable indicator
> for the case of bug 973596. Better confirm is it thrashing at first.

How about page fault/context switch/user space CPU time?
Monitor system status for incoming call, also Gallery launching for comparison.
NI on :joe to help follow-up on next steps here as discussed.
Flags: needinfo?(jcheng)
James, could you please merge the patch http://goo.gl/SjBj7r which enlarges the min_free of oom_adj 6, 8, and 10?
Flags: needinfo?(james.zhang)
triage: minus for now
if we have concrete plans on how to move forward with this bug, please renom for 1.3T? thanks
blocking-b2g: 1.3T+ → -
Flags: needinfo?(jcheng)
Done.


commit b8ee760324a8b9d714fa761bed2b3ad823f8e9a5
Author: james.zhang <james.zhang@spreadtrum.com>
Date:   Tue Mar 25 21:53:03 2014 +0800

    Bug#269156 merge patch from mozilla, Bug 974308 - [tarako] Study memory thrashing and find out a proper way to handle it.
    
    [bug number  ]
    [root cause  ]
    [changes     ]
    [side effects]
    [self test   ] mozilla test
    [reviewers   ]
    
    Change-Id: I5f4673c7c27d88d8669eb3f497af6fd05472752b
Flags: needinfo?(james.zhang)
After killing background applications, there're still chances run into thrashing when multiple web activities are requested. In that case, is there anything else we can do except killing foreground application?
music app is very easy to be killed by:
pref("hal.processPriorityManager.gonk.BACKGROUND.KillUnderKB", 24576);
Depends on: 988164
(In reply to Ting-Yu Chou [:ting] from comment #63)
> After killing background applications, there're still chances run into
> thrashing when multiple web activities are requested. In that case, is there
> anything else we can do except killing foreground application?

The detection of thrashing is a complementary of OOM.  It fixes the problem of OOM killer killing too late for some situation, and inflexible.
Tested multiple web activites Contacts -> SMS -> Gallery with bug 982491 applied, it was thrashing while picking image, note there's no background app.

Is it possible not to keep all the applications on the activity chain in foreground? For instance:

  1. Contacts -> SMS
  2. Contacts moves to background and may be killed
  3. SMS -> Gallery
  4. SMS moves to background and may be killed
  5. An image is picked, post result to SMS and launch it if it has been killed
  6. SMS done, launch Contacts if it has been killed

# There were two processes Communications and Gallery, seems Gallery is not grouped.
(In reply to Ting-Yu Chou [:ting] from comment #66)
> Is it possible not to keep all the applications on the activity chain in
> foreground? For instance:
> 
>   1. Contacts -> SMS
>   2. Contacts moves to background and may be killed
>   3. SMS -> Gallery

We could have this if we fix bug 892371.

>   4. SMS moves to background and may be killed
>   5. An image is picked, post result to SMS and launch it if it has been
> killed

Re-starting SMS after it has been killed is impossible as there's no way to save and restore the last state the app was in. What we can hope is that sending the contacts app to the background is enough to have the SMS app and gallery alive during the activity.
I will come out a daemon detecting thrashing for the reason Thinker commented at comment 65, but still I am not sure how should it handle the case when there're only foreground applications. Killing seems not a good idea, user may never finish her task.
(In reply to Gabriele Svelto [:gsvelto] from comment #67)

> Re-starting SMS after it has been killed is impossible as there's no way to
> save and restore the last state the app was in. What we can hope is that
> sending the contacts app to the background is enough to have the SMS app and
> gallery alive during the activity.

I have raised a discussion of session restoring at dev-b2g mailing list.  I am not sure how far it is from implementation, but it could solve the problem of restoring state of the app.
Attached patch thrashdog.patch (obsolete) — Splinter Review
Created a thrashing detector based on pswpin/pswpout of /proc/vmstat, it determines thrashing if the system is swapping in/out more than 2MB each for 3 seconds continuously (per experimental data). For now it kills only background application if thrashing is detected, which does not help for the case of web activities chain.

It selects victim the same as LMK, maximum RSS and swap consumption with largest oom_adj.
(In reply to Gabriele Svelto [:gsvelto] from comment #67)
> Re-starting SMS after it has been killed is impossible as there's no way to
> save and restore the last state the app was in. What we can hope is that
> sending the contacts app to the background is enough to have the SMS app and
> gallery alive during the activity.

As web activities can be chained differently, guess it'd be difficult to handle by case.

Note there're different test cases for chaining web activities (see bug 972669, comment 32), I am  worried how to pass them.
Attached patch thrashdog.patchSplinter Review
Some minor updates and print not only selected process id but also its name before killing.
Attachment #8399295 - Attachment is obsolete: true
I'm intend to modify watermark of memory zone in kernel, to advance the wakeup of kswapd and increase the work load of it.
So the kswapd would start to work after free memory was lower than WMARK_LOW,and would not stop until free memroy was higher than WMARK_HIGH.

I think it may reduce the thrashing, But I don't know how to test it.


ffos@ffos2:~/yingxu/6821/kernel$ git diff HEAD~1
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2500ec4..9f92ddc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5184,8 +5184,8 @@ void setup_per_zone_wmarks(void)
                        zone->watermark[WMARK_MIN] = tmp;
                }
 
-               zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp >> 2);
-               zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp >> 1);
+               zone->watermark[WMARK_LOW]  = min_wmark_pages(zone) + (tmp);
+               zone->watermark[WMARK_HIGH] = min_wmark_pages(zone) + (tmp*2);
                setup_zone_migrate_reserve(zone);
                spin_unlock_irqrestore(&zone->lock, flags);
        }
current LMK setting: music app in background can survive longer till 8172 KB.

 notify_trigger 14336 KB

  oom_adj min_free
        0  1024 KB
        1  2048 KB
        2  4096 KB
        6  8172 KB
        8 16384 KB
       10 18432 KB

commit 3945e13d8165bc6e0a5d865c6f12ce63954d910c
Author: ying.xu <ying.xu@spreadtrum.com>
Date:   Sat Apr 5 13:31:09 2014 +0800

    Bug#245770 change the LMK parameters
    
    [self test   ] kill process OK
    
    Change-Id: I04043c0911463652bd927be028eafe28a5cd243c
Priority: P1 → P3
tracking-b2g: --- → -
I don't think we'll ever touch a Tarako again so I'm closing this. Please reopen if you think this is still somehow relevant.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: