Closed Bug 897769 Opened 8 years ago Closed 8 years ago

Test/benchmark PERF_SAMPLE_STACK_USER on B2G


(Firefox OS Graveyard :: General, defect)

Gonk (Firefox OS)
Not set


(Not tracked)



(Reporter: jld, Assigned: jld)



(Keywords: perf, Whiteboard: [c=profiling p=3 s=2013.08.09])


(1 file)

Part of the story for using ARM exception handling tables instead of the current frame pointer hacks is being able to get userspace stacks for perf_event profiling.  There were Linux kernel changes in August 2012 to allow copying part of the sampled process's stack into the perf buffer as part of the sample so that a userland agent could perform table-driven stack unwinding instead of trying to embed that much complexity in the kernel.

So we'd need to backport it to the older kernels we're using for b2g, and get an idea of how well it performs.  This is, I think, the main item of missing information here.
Here's a small benchmark program on my keon:
    0m2.98s real     0m2.98s user     0m0.00s system

With perf running globally at 1 kHz, not copying the stack:
    0m3.01s real     0m3.00s user     0m0.00s system

Copying 512 bytes of stack per sample:
    0m3.01s real     0m3.00s user     0m0.00s system

Copying 32 KiB of stack (same as what the breakpad unwinder in Gecko does):
    0m3.30s real     0m3.14s user     0m0.00s system

Copying up to 32 KiB of stack (and allocating that much buffer space per sample), but using only 1184 bytes[1]:
    0m3.21s real     0m3.05s user     0m0.00s system

The "real" time includes fwrite()ing the full records to /dev/null, which appears to be slower than actually unwinding them will be[2].  The "user" time difference is the actual cost of the interrupt handler.  The empty space in the last case shouldn't cost anything directly (it's not zeroed or otherwise written), but it presumably has cache and/or TLB effects, and increases profiler wakeups.

For something to measure this against, here's an example of the current in-kernel frame pointer unwinding: 1 kHz, -mapcs-frame, 102 stack frames:

    0m3.05s real     0m3.04s user     0m0.00s system

[1] The stack dump proceeds until the specified size limit is reached or an access fails, so if we're on the main stack then the area with the process's arguments and initial environment will be copied.

[2] My work in progress on bug 810526 has been getting 50-60 µs/sample on somewhat deeply nested stacks, of which a large minority was the Gecko profiler infrastructure.  Additionally, there remains room for optimization, and it should be faster when it's not handling the "pop under bitmask" instructions needed for the frame pointers used for meta-profiling.)
Attached file loop.c
My small "benchmark" program.  Creates a bunch of frames and runs a timing loop.
I'm going to say that the answer is “yes, fast enough”.  1kHz should be enough for most uses.
Closed: 8 years ago
Resolution: --- → FIXED
Keywords: perf
Whiteboard: [c=profiling,p=3] → [c=profiling p=3 s=2013.08.09]
You need to log in before you can comment on or make changes to this bug.