Closed Bug 948648 Opened 11 years ago Closed 8 years ago

Profiling for post-fork copy-on-write when using Nuwa

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: jld, Unassigned)

References

(Depends on 1 open bug)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(5 files)

COW test program 11 years ago Cervantes Yu [:cyu] [:cervantes] 1.40 KB, text/x-c++src		Details
Linux kernel: add PERF_COUNT_SW_COW_FAULTS. 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 1.65 KB, patch		Details \| Diff \| Splinter Review
Bionic: add ARM unwind info to memset/memcpy. 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 2.29 KB, patch		Details \| Diff \| Splinter Review
Gecko, part 1: Deliver SIGPROF for each post-Nuwa COW. 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 8.41 KB, patch		Details \| Diff \| Splinter Review
Gecko, part 2: Start the profiler and collect the COW signals. 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 10.66 KB, patch		Details \| Diff \| Splinter Review

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Description

•

11 years ago

In a child process forked from Nuwa, whenever we write to a page that was made copy-on-write by the fork, we have effectively allocated 4 KiB of system memory — and, if this happens in all children, the corresponding page in the Nuwa process is unique to it (i.e., part of the constant overhead that the COW sharing has to pay off). If we had more data on these events, we might discover that we're unnecessarily sharing pages between written-after-fork and non-written-after-fork data, or that we have data we always overwrite post-fork that we could simply delay computing until then, or other memory wins along those lines. This could be done with the perf toolchain and kernel modifications to add a software event for this (but see also the previous difficulties encountered in doing anything useful with perf on B2G), or presumably with dtrace or systemtap or any of the other general instrumentation tools (if we had support for them). It could also in theory be accomplished in userland by changing memory protections during Nuwa freeze and using a carefully written SIGSEGV handler to collect the data, but that may be more difficult to get right than the alternatives.

Nicholas Nethercote [inactive]

Updated

•

11 years ago

Whiteboard: [MemShrink] → [MemShrink:P1]

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Updated

•

11 years ago

Assignee: nobody → jld

Thinker Li [:sinker]

Comment 1

•

11 years ago

bug 952393 is related to this bug. It could be done in a easier way.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 2

•

11 years ago

Another approach would be to enable about:memory to report data that comes from unshared memory vs. shared memory.

Robert Kaiser

Comment 3

•

11 years ago

There might be some things where the memory-saving efforts here and the long-term plans for multi-content-process desktop Firefox might overlap. Bill mentions some things on caches in the "How much memory will it use?" topic in http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/ - might such an approach gives us something useful on FxOS as well?

Thinker Li [:sinker]

Comment 4

•

11 years ago

We had found COW may increase launch time of apps for that it introduces page faults for copying pages. Copying memory is fast, but the overhead of page fault is high. We sill study this impaction further.

Cervantes Yu [:cyu] [:cervantes]

Comment 5

•

11 years ago

This is the b2g-procrank result of the latest build: APPLICATION PID Vss Rss Pss Uss cmdline b2g 12642 77732K 67736K 54392K 49528K /system/b2g/b2g Clock 13140 55992K 29312K 14732K 11776K /system/b2g/plugin-container Homescreen 12915 28544K 28544K 13723K 10568K /system/b2g/plugin-container Usage 12850 28392K 25624K 11146K 8256K /system/b2g/plugin-container Built-in Keyboa 13036 23068K 23068K 9200K 6544K /system/b2g/plugin-container (Nuwa) 12734 21084K 21084K 8336K 3976K /system/b2g/plugin-container (Preallocated a 31065 19260K 19260K 6869K 4556K /system/b2g/plugin-container I made some tests on unagi and found that copy on write 1000 pages on it takes 16 msec. The preallocated process forked from Nuwa has about 4.5 MB USS. The keyboard app has 6.5 MB. Launching an app from the preallocated process should COW no more than 2 MB memory, which should be done in about 8 msec. This doesn't fully explain how the 40 msec slowdown of the Clock app launch.

Cervantes Yu [:cyu] [:cervantes]

Comment 6

•

11 years ago

Attached file COW test program — Details

The test program for COW time measurement.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 7

•

11 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #3) > There might be some things where the memory-saving efforts here and the > long-term plans for multi-content-process desktop Firefox might overlap. > Bill mentions some things on caches in the "How much memory will it use?" > topic in http://billmccloskey.wordpress.com/2013/12/05/multiprocess-firefox/ > - might such an approach gives us something useful on FxOS as well? Nuwa is quite b2g specific, so this will have very limited impact on desktop firefox.

Robert Kaiser

Comment 8

•

11 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #7) > Nuwa is quite b2g specific, so this will have very limited impact on desktop > firefox. I know NuWa is FxOS-specific. I also was asking about memory sharing efforts that help on multi-content-process desktop potentially helping post-fork behavior on FxOS. But this all may be irrelevant here, not sure.

Thinker Li [:sinker]

Comment 9

•

11 years ago

(In reply to Cervantes Yu from comment #5) > process should COW no more than 2 MB memory, which should be done in about 8 I don't say that for now. The status of making COW private copies is vary from app to app. USS of keyboard is the most small one in the example, but it does mean it could be a upper bound of making private copies. I have an idea of checking how many pages are copied after loading app. 1. By attach Nuwa process and content process of the observed app with gdb, they are prevented from being killed for any signal or IPC messages. So, 2. we could kill all b2g processes except Nuwa process and the observed content process. 3. Then, by comparing USS of the observed content before and after killing Nuwa process, we get exactly number of pages being shared between Nuwa and the observed content process. (Be moved from the PSS of both processes to the USS of the observed process.) 4. Do the same experiment for Nuwa and preallocated process, then 5. we get the number of pages being copied since loading an app to a preallocated process.

Cervantes Yu [:cyu] [:cervantes]

Comment 10

•

11 years ago

Some test result w.r.t comment #9: Test steps: - Launch b2g normally. - Attach the b2g process, the Nuwa process and the preallocated process. - Kill processes except the attached 3. - Get memory usage with b2g-procrank. - Kill the Nuwa process. - Get memory usage again with b2g-procrank. The preallocated process increases 4924KB in its USS. Test with the Clock app: - Launch b2g normally and launch the Clock app. - Attach the b2g process, the Nuwa process and Clock app. - Kill processes except the attached 3. - Get memory usage with b2g-procrank. - Kill the Nuwa process - Get memory usage again with b2g-procrank. The Clock app increases 4748 KB in its USS. From the tests, we may get an estimation of pages being copied in app launch to be about 176 KB. Looks like app launch doesn't cause many pages being copied. We should place more attention to pages being copied after fork() and before app launch.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 11

•

11 years ago

(In reply to Cervantes Yu from comment #6) > Created attachment 8355515 [details] > COW test program > > The test program for COW time measurement. FYI: I tried modifying the program to access the pages non-sequentially, because I was wondering if locality could explain some of the difference, but it didn't change much: 173 vs. 166 ms for 10000 pages.

Thinker Li [:sinker]

Comment 12

•

11 years ago

We should consider the effects of cache drops. Page fault means to drop some old pages to load new pages. It means some pages in the working set of future are potentially being dropped. The drops could introduce more I/O and slowing down.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Updated

•

11 years ago

Depends on: 990790

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 13

•

11 years ago

Attached patch Linux kernel: add PERF_COUNT_SW_COW_FAULTS. — Details — Splinter Review

Patch is based on the Geeksphone Keon kernel (3.0.8+) source; it won't apply to master without changes, but it's simple enough that adapting it should be easy.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 14

•

11 years ago

Attached patch Bionic: add ARM unwind info to memset/memcpy. — Details — Splinter Review

Same as attachment 8413976 [details] [diff] [review] from bug 990790. Not strictly necessary, but we'll lose some stacks without it.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 15

•

11 years ago

Attached patch Gecko, part 1: Deliver SIGPROF for each post-Nuwa COW. — Details — Splinter Review

With this patch, if $MOZ_NUWA_COW_PROFILE is set, any thread in a Nuwa child will be sent SIGPROF whenever it causes a page to be allocated for copy-on-write (and SIGPROF is set to be ignored if not already caught, so that the process doesn't immediately die.)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 16

•

11 years ago

Attached patch Gecko, part 2: Start the profiler and collect the COW signals. — Details — Splinter Review

And this patch causes $MOZ_NUWA_COW_PROFILE to start the profiler after the Nuwa fork is finished (on the child side) but skip the usual signal sending. This creates a profile of COW faults for each thread registered with the profiler. This won't catch COWs prior to the profiler's SIGPROF handler being installed, or on non-profiler-registered threads, and its handling of COW faults caused by the signal handler itself could be improved (the signal is blocked while the handler runs, so it will become pending and the handler will be re-invoked after it returns — but it's not a realtime signal, so multiple instances won't be queued in this way). The current number of COW faults for a thread can be determined by read()ing the perf_event fd, which is available to the signal handler in the si_fd member of the siginfo, but this feature isn't used yet. And, of course, it's possible to mmap the fd and receive COW fault information asynchronously.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 17

•

11 years ago

An example: https://people.mozilla.org/~bgirard/cleopatra/#report=508a6291a785e1ba448a4d187e07806a34804d8d I booted the phone, unlocked the homescreen, and swiped through the pages.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 18

•

11 years ago

I've discovered a few problems with the Gecko patch I've posted; I have fixes for them, except one: Binder's thread creation apparently bypasses mozglue (!), so those threads aren't sampled. (They already weren't sampled because they're not registered with the profiler, but I could at least get a count of samples so missed, if any.) It's not clear that this is significant, however.

Cervantes Yu [:cyu] [:cervantes]

Comment 19

•

11 years ago

We can safely skip the binder threads. I tried to skip starting the binder thread pool in the Nuwa process and the USS just lowered by 64 KB. I think it's because most content processes don't use binder.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Updated

•

11 years ago

Flags: needinfo?(khuey)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 20

•

11 years ago

How do I read the profile in comment 17? Can you post your updated Gecko patches?

Flags: needinfo?(khuey) → needinfo?(jld)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 21

•

10 years ago

Clearing needinfo, as I won't have time to work on this in the immediate future; checked with khuey on IRC.

Flags: needinfo?(jld)

Nicholas Nethercote [inactive]

Updated

•

10 years ago

Whiteboard: [MemShrink:P1] → [MemShrink:P2]

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Reporter

Comment 22

•

9 years ago

I won't be working on this. I'll let someone else decide whether to leave it open or WONTFIX.

Assignee: jld → nobody

Jan Beich

Comment 23

•

8 years ago

Nuwa is gone after bug 1284674.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.

COW test program 11 years ago Cervantes Yu [:cyu] [:cervantes] 1.40 KB, text/x-c++src		Details
Linux kernel: add PERF_COUNT_SW_COW_FAULTS. 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 1.65 KB, patch		Details \| Diff \| Splinter Review
Bionic: add ARM unwind info to memset/memcpy. 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 2.29 KB, patch		Details \| Diff \| Splinter Review
Gecko, part 1: Deliver SIGPROF for each post-Nuwa COW. 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 8.41 KB, patch		Details \| Diff \| Splinter Review
Gecko, part 2: Start the profiler and collect the COW signals. 11 years ago Jed Davis [:jld] ⟨⏰\|UTC-7⟩ ⟦he/him⟧ 10.66 KB, patch		Details \| Diff \| Splinter Review