Closed Bug 948648 Opened 8 years ago Closed 4 years ago

Profiling for post-fork copy-on-write when using Nuwa

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jld, Unassigned)

References

(Depends on 1 open bug)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(5 files)

In a child process forked from Nuwa, whenever we write to a page that was made copy-on-write by the fork, we have effectively allocated 4 KiB of system memory — and, if this happens in all children, the corresponding page in the Nuwa process is unique to it (i.e., part of the constant overhead that the COW sharing has to pay off).

If we had more data on these events, we might discover that we're unnecessarily sharing pages between written-after-fork and non-written-after-fork data, or that we have data we always overwrite post-fork that we could simply delay computing until then, or other memory wins along those lines.

This could be done with the perf toolchain and kernel modifications to add a software event for this (but see also the previous difficulties encountered in doing anything useful with perf on B2G), or presumably with dtrace or systemtap or any of the other general instrumentation tools (if we had support for them).

It could also in theory be accomplished in userland by changing memory protections during Nuwa freeze and using a carefully written SIGSEGV handler to collect the data, but that may be more difficult to get right than the alternatives.
Whiteboard: [MemShrink] → [MemShrink:P1]
Assignee: nobody → jld
bug 952393 is related to this bug.  It could be done in a easier way.
Another approach would be to enable about:memory to report data that comes from unshared memory vs. shared memory.
There might be some things where the memory-saving efforts here and the long-term plans for multi-content-process desktop Firefox might overlap. Bill mentions some things on caches in the "How much memory will it use?" topic in http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/ - might such an approach gives us something useful on FxOS as well?
We had found COW may increase launch time of apps for that it introduces page faults for copying pages.  Copying memory is fast, but the overhead of page fault is high.  We sill study this impaction further.
This is the b2g-procrank result of the latest build:

APPLICATION        PID      Vss      Rss      Pss      Uss  cmdline
b2g              12642   77732K   67736K   54392K   49528K  /system/b2g/b2g
Clock            13140   55992K   29312K   14732K   11776K  /system/b2g/plugin-container
Homescreen       12915   28544K   28544K   13723K   10568K  /system/b2g/plugin-container
Usage            12850   28392K   25624K   11146K    8256K  /system/b2g/plugin-container
Built-in Keyboa  13036   23068K   23068K    9200K    6544K  /system/b2g/plugin-container
(Nuwa)           12734   21084K   21084K    8336K    3976K  /system/b2g/plugin-container
(Preallocated a  31065   19260K   19260K    6869K    4556K  /system/b2g/plugin-container

I made some tests on unagi and found that copy on write 1000 pages on it takes 16 msec. The preallocated process forked from Nuwa has about 4.5 MB USS. The keyboard app has 6.5 MB. Launching an app from the preallocated process should COW no more than 2 MB memory, which should be done in about 8 msec. This doesn't fully explain how the 40 msec slowdown of the Clock app launch.
Attached file COW test program
The test program for COW time measurement.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #3)
> There might be some things where the memory-saving efforts here and the
> long-term plans for multi-content-process desktop Firefox might overlap.
> Bill mentions some things on caches in the "How much memory will it use?"
> topic in http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/
> - might such an approach gives us something useful on FxOS as well?

Nuwa is quite b2g specific, so this will have very limited impact on desktop firefox.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #7)
> Nuwa is quite b2g specific, so this will have very limited impact on desktop
> firefox.

I know NuWa is FxOS-specific. I also was asking about memory sharing efforts that help on multi-content-process desktop potentially helping post-fork behavior on FxOS. But this all may be irrelevant here, not sure.
(In reply to Cervantes Yu from comment #5)
> process should COW no more than 2 MB memory, which should be done in about 8
I don't say that for now.  The status of making COW private copies is vary from app to app.  USS of keyboard is the most small one in the example, but it does mean it could be a upper bound of making private copies.  I have an idea of checking how many pages are copied after loading app.  

 1. By attach Nuwa process and content process of the observed app with gdb, they are prevented from being killed for any signal or IPC messages.  So,
 2. we could kill all b2g processes except Nuwa process and the observed content process.
 3. Then, by comparing USS of the observed content before and after killing Nuwa process, we get exactly number of pages being shared between Nuwa and the observed content process. (Be moved from the PSS of both processes to the USS of the observed process.)
 4. Do the same experiment for Nuwa and preallocated process, then
 5. we get the number of pages being copied since loading an app to a preallocated process.
Some test result w.r.t comment #9:

Test steps:
- Launch b2g normally.
- Attach the b2g process, the Nuwa process and the preallocated process.
- Kill processes except the attached 3.
- Get memory usage with b2g-procrank.
- Kill the Nuwa process.
- Get memory usage again with b2g-procrank.

The preallocated process increases 4924KB in its USS.

Test with the Clock app:
- Launch b2g normally and launch the Clock app.
- Attach the b2g process, the Nuwa process and Clock app.
- Kill processes except the attached 3.
- Get memory usage with b2g-procrank.
- Kill the Nuwa process
- Get memory usage again with b2g-procrank.

The Clock app increases 4748 KB in its USS.

From the tests, we may get an estimation of pages being copied in app launch to be about 176 KB. Looks like app launch doesn't cause many pages being copied. We should place more attention to pages being copied after fork() and before app launch.
(In reply to Cervantes Yu from comment #6)
> Created attachment 8355515 [details]
> COW test program
> 
> The test program for COW time measurement.

FYI: I tried modifying the program to access the pages non-sequentially, because I was wondering if locality could explain some of the difference, but it didn't change much: 173 vs. 166 ms for 10000 pages.
We should consider the effects of cache drops.  Page fault means to drop some old pages to load new pages.  It means some pages in the working set of future are potentially being dropped.  The drops could introduce more I/O and slowing down.
Patch is based on the Geeksphone Keon kernel (3.0.8+) source; it won't apply to master without changes, but it's simple enough that adapting it should be easy.
Same as attachment 8413976 [details] [diff] [review] from bug 990790.  Not strictly necessary, but we'll lose some stacks without it.
With this patch, if $MOZ_NUWA_COW_PROFILE is set, any thread in a Nuwa child will be sent SIGPROF whenever it causes a page to be allocated for copy-on-write (and SIGPROF is set to be ignored if not already caught, so that the process doesn't immediately die.)
And this patch causes $MOZ_NUWA_COW_PROFILE to start the profiler after the Nuwa fork is finished (on the child side) but skip the usual signal sending.  This creates a profile of COW faults for each thread registered with the profiler.

This won't catch COWs prior to the profiler's SIGPROF handler being installed, or on non-profiler-registered threads, and its handling of COW faults caused by the signal handler itself could be improved (the signal is blocked while the handler runs, so it will become pending and the handler will be re-invoked after it returns — but it's not a realtime signal, so multiple instances won't be queued in this way).

The current number of COW faults for a thread can be determined by read()ing the perf_event fd, which is available to the signal handler in the si_fd member of the siginfo, but this feature isn't used yet.  And, of course, it's possible to mmap the fd and receive COW fault information asynchronously.
An example: 
https://people.mozilla.org/~bgirard/cleopatra/#report=508a6291a785e1ba448a4d187e07806a34804d8d

I booted the phone, unlocked the homescreen, and swiped through the pages.
I've discovered a few problems with the Gecko patch I've posted; I have fixes for them, except one: Binder's thread creation apparently bypasses mozglue (!), so those threads aren't sampled.  (They already weren't sampled because they're not registered with the profiler, but I could at least get a count of samples so missed, if any.)  It's not clear that this is significant, however.
We can safely skip the binder threads. I tried to skip starting the binder thread pool in the Nuwa process and the USS just lowered by 64 KB. I think it's because most content processes don't use binder.
Flags: needinfo?(khuey)
How do I read the profile in comment 17?

Can you post your updated Gecko patches?
Flags: needinfo?(khuey) → needinfo?(jld)
Clearing needinfo, as I won't have time to work on this in the immediate future; checked with khuey on IRC.
Flags: needinfo?(jld)
Whiteboard: [MemShrink:P1] → [MemShrink:P2]
I won't be working on this.  I'll let someone else decide whether to leave it open or WONTFIX.
Assignee: jld → nobody
Nuwa is gone after bug 1284674.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.