Closed Bug 897769 Opened 8 years ago Closed 8 years ago
_SAMPLE _STACK _USER on B2G
Part of the story for using ARM exception handling tables instead of the current frame pointer hacks is being able to get userspace stacks for perf_event profiling. There were Linux kernel changes in August 2012 to allow copying part of the sampled process's stack into the perf buffer as part of the sample so that a userland agent could perform table-driven stack unwinding instead of trying to embed that much complexity in the kernel. So we'd need to backport it to the older kernels we're using for b2g, and get an idea of how well it performs. This is, I think, the main item of missing information here.
Here's a small benchmark program on my keon: 0m2.98s real 0m2.98s user 0m0.00s system With perf running globally at 1 kHz, not copying the stack: 0m3.01s real 0m3.00s user 0m0.00s system Copying 512 bytes of stack per sample: 0m3.01s real 0m3.00s user 0m0.00s system Copying 32 KiB of stack (same as what the breakpad unwinder in Gecko does): 0m3.30s real 0m3.14s user 0m0.00s system Copying up to 32 KiB of stack (and allocating that much buffer space per sample), but using only 1184 bytes: 0m3.21s real 0m3.05s user 0m0.00s system The "real" time includes fwrite()ing the full records to /dev/null, which appears to be slower than actually unwinding them will be. The "user" time difference is the actual cost of the interrupt handler. The empty space in the last case shouldn't cost anything directly (it's not zeroed or otherwise written), but it presumably has cache and/or TLB effects, and increases profiler wakeups. For something to measure this against, here's an example of the current in-kernel frame pointer unwinding: 1 kHz, -mapcs-frame, 102 stack frames: 0m3.05s real 0m3.04s user 0m0.00s system  The stack dump proceeds until the specified size limit is reached or an access fails, so if we're on the main stack then the area with the process's arguments and initial environment will be copied.  My work in progress on bug 810526 has been getting 50-60 µs/sample on somewhat deeply nested stacks, of which a large minority was the Gecko profiler infrastructure. Additionally, there remains room for optimization, and it should be faster when it's not handling the "pop under bitmask" instructions needed for the frame pointers used for meta-profiling.)
The kernel source: https://github.com/jld/gp-keon-kernel/compare/perf-stackcopy-gp
My small "benchmark" program. Creates a bunch of frames and runs a timing loop.
I'm going to say that the answer is “yes, fast enough”. 1kHz should be enough for most uses.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Whiteboard: [c=profiling,p=3] → [c=profiling p=3 s=2013.08.09]
You need to log in before you can comment on or make changes to this bug.