Closed Bug 831611 Opened 11 years ago Closed 7 years ago

[META] Get perf working on b2g

Categories

(Firefox OS Graveyard :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: justin.lebar+bug, Unassigned)

References

Details

(Whiteboard: [perf-reviewed])

Attachments

(3 files, 3 obsolete files)

The Gecko profiler is nice, but does not work well on B2G, principally because it can't walk native (i.e. C++) stacks: AIUI the stack-walking code we have doesn't work from within signal handlers.

Anyway, I think it should be easy to get perf working on b2g.  I already got a somewhat-sketchy binary [1] and ran it.  I now need to write fix_perf_stack.py to symbolicate the result.

[1] http://code.google.com/p/android-group-korea/downloads/detail?name=perf-binary-kwangwoo-for-android.zip&can=2&q=
It's possible that I've messed up the address translation here, but I checked a few of them by hand, and they seem right.  This report looks pretty useless.
Oh, and this was on a DMD build, which has -funwind-tables.  But perhaps perf isn't using that.
Heh, we apparently did this once. https://github.com/andreasgal/B2G/issues/167

When I run perf report, I get

> Failed to open /system/bin/linker, continuing without symbols

but I presume that's not what's keeping me from getting good addresses out of perf which I can later translate.
Okay, so taking objdir-gecko/toolkit/library/libxul.so, running strip -g (strips out debug info), putting that on the phone, and then running perf gives me good top-level frames.  The stacks are still useless, though, and it still shows us spending 75% startup time in a mystery function.

The mystery function's PC is 0xffff0fd4, which looks an awful lot like a stack address or something.  I'm not sure what to make of that; I wouldn't expect even the JIT to run code off the stack.

Another reason not to believe it's JIT code is that it's the one thing the Gecko profiler is supposed to know how to show us, and the gandalf's SPS profiles in bug 831135 comment 4 do not show time spent in the JIT.
(In reply to Justin Lebar [:jlebar] from comment #4)
> Okay, so taking objdir-gecko/toolkit/library/libxul.so, running strip -g
> (strips out debug info), putting that on the phone, and then running perf
> gives me good top-level frames.  The stacks are still useless, though, and
> it still shows us spending 75% startup time in a mystery function.
> 
> The mystery function's PC is 0xffff0fd4, which looks an awful lot like a
> stack address or something.  I'm not sure what to make of that; I wouldn't
> expect even the JIT to run code off the stack.

I recognize that address. Take a look here:
https://github.com/mozilla-b2g/B2G/blob/master/scripts/profile-symbolicate.py#L5

0xffff0fd4 corresponds to __kernel_cmpxchg which is used to implement atomic operations in user space.
Actually, looking at the bionic source, it seems that we may be using something designed for ARMv5.

If I do:

touch empty.c
prebuilt/linux-x86/toolchain/arm-linux-androideabi-4.4.x/bin/arm-linux-androideabi-gcc -E -dD empty.c | grep ARCH

it reports:

#define __ARM_ARCH_5TE__ 1

My unagi reports being an ARMv7, so theoretically, we shoud be able to rebuilt the toolchain or add -mcpu=cortex-a8 then it reports:

#define __ARM_ARCH_7A__ 1

which would enable the presumably more efficient variant of the atomic operations.

See: bionic/libc/arch-arm/bionic/atomics_arm.S
Note perf doesn't like elfhack, so you have to disable elfhack if you have it enabled.
You can also get debug symbols for system libraries by invoking a build.sh command i forget.
So I disassembled ./out/target/product/otoro/obj/STATIC_LIBRARIES/libc_common_intermediates/arch-arm/bionic/atomics_arm.o and it seems to be using LDREX and STREX, so I wonder where the references to 0xffff0fd4 are coming from?
> so I wonder where the references to 0xffff0fd4 are coming from?

It could be being called from the kernel, for all we know.  :-/
Attached file Perf report from main process (obsolete) —
elfhack disabled, building with -fno-omit-frame-pointer and -funwind-tables.  I unfortunately still don't have useful stacks from Gecko.

However, there is an interesting point here, if we believe it: It says that the main process spends ~50% of its time in finish_task_switch.clone.3 in the kernel.

We don't have a Gecko caller for most of the context switches, but the one called out here (pt_PostNotifies, ptsynch.c:111) seems reasonable.

Note that perf is a CPU time profiler, rather than a wall clock time profiler.  (SPS profiles wall time.)  So it's not clear how much of the main process's time is spent asleep, versus how much time is spent context-switching.
>      7.37%  GL updater  libdmd.so                 [.] __udivsi3

This is also interesting, because the only place DMD does division (I checked the object file) is in replace_aligned_alloc and replace_posix_memalign.  I wouldn't expect those functions to be called often enough to show up here.
I found the code in the kernel which walks the userspace stacks for perf.

It is not very intelligent.  :)

http://lxr.linux.no/#linux+v3.7.3/arch/arm/kernel/perf_event.c#L528

It uses the frame pointer (regs->ARM_fp) and then expects that to point to the previous frame's fp, and so on.

What's interesting is that even when I compile a simple program with -fno-omit-frame-pointer, perf still can't walk the stack.  Perhaps I'm not compiling it right, or perhaps the stack-walking code in the kernel on our devices is different from the code at kernel tip.
Okay, we discovered that perf seems to be able to walk stacks if you do

  -marm -mapcs-frame -fno-omit-frame-pointer

Crucially, perf does /not/ seem happy with

  -mthumb -mtpcs-frame -mtpcs-leaf-frame -fno-omit-frame-pointer

Which means that if we want perf stacks, we apparently have to compile as arm.  Let's see if this works at all...
Depends on: 832379
Depends on: 832752
(In reply to Justin Lebar [:jlebar] from comment #13)
> Okay, we discovered that perf seems to be able to walk stacks if you do
> 
>   -marm -mapcs-frame -fno-omit-frame-pointer
> 
> Crucially, perf does /not/ seem happy with
> 
>   -mthumb -mtpcs-frame -mtpcs-leaf-frame -fno-omit-frame-pointer

That's the same issue I've encountered when trying to implement stack walking using _Unwind_Backtrace(). If you try generating a back-trace from GDB using the same flags you'll encounter the same problem. The issue is that when the kernel enters the signal handler it leaves what looks like a non-Thumb stack frame on the stack even when the signal handler has been invoked in a Thumb executable. Most unwinders get confused by this (understandably, the ABI is not the same) and will work only if your code is also compiled in non-Thumb mode.

> Which means that if we want perf stacks, we apparently have to compile as
> arm.  Let's see if this works at all...

The issue is that you'll have some significant code-size differences between ARM- and Thumb-mode so the results will be a bit skewed. I was hoping to set aside enough free time to implement a minimal NS_StackWalk() that would deal with this if the breakpad-based profiler doesn't land soon but I'm not really sure how long it might take me.
> The issue is that when the kernel enters the signal handler it leaves what looks like a non-Thumb 
> stack frame on the stack even when the signal handler has been invoked in a Thumb executable. Most 
> unwinders get confused by this (understandably, the ABI is not the same) and will work only if your 
> code is also compiled in non-Thumb mode.

I don't think this is the issue with the kernel unwinding code I linked to in comment 12.

First of all there's no user-space signal handler, as far as I can tell.  It's a kernel interrupt.  So the kernel does not have to unwind through a signal handler.

But second of all, the kernel does not appear to have any check for whether we're in ARM or Thumb mode.  It's assuming that the FP is in the same place in both cases.  And -mthumb -mtpcs-frame -mtpcs-leaf-frame -fno-omit-frame-pointer does not put the FP there, so the kernel can't unwind.
fwiw it would be /great/ if we could figure out how to get perf to be happy with thumb code.  As you say, -marm will have different performance characteristics than -mthumb.  But also, -marm doesn't work with gcc 4.4, so we have to compile with a different compiler!  And -mapcs-frame will of course obviously make things slower.  At some point it becomes a different build...  :)

But I'm still convinced that we want to get perf working one way or another, if we can; we have too many issues with "I have no idea where this time is going" with the built-in profiler, and it's not entirely because we can't walk the native stacks.  See e.g. bug 831135.
(In reply to Justin Lebar [:jlebar] from comment #15)
> First of all there's no user-space signal handler, as far as I can tell. 
> It's a kernel interrupt.  So the kernel does not have to unwind through a
> signal handler.

AFAIK the kernel leaves a dummy frame which will cause the userspace signal handler to invoke sigreturn() when returning. The only thing that changes is the status of the CPSR (current status program register), see setup_return():

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob_plain;f=arch/arm/kernel/signal.c;hb=HEAD

The code in _Unwind_Backtrace() or GDB however excepts all the stack frames to conform to the Thumb ABI and this particular frame doesn't thus breaking the whole process. I'm not sure if the kernel has the same issue but from your description I'd guess it does.

> But second of all, the kernel does not appear to have any check for whether
> we're in ARM or Thumb mode.  It's assuming that the FP is in the same place
> in both cases.  And -mthumb -mtpcs-frame -mtpcs-leaf-frame
> -fno-omit-frame-pointer does not put the FP there, so the kernel can't
> unwind.

Yeah, this sounds a lot like a problem caused by an ABI mismatch, i.e. the last Thumb frame before the handler points back as if the signal handler frame would also conform to the Thumb ABI. Since it doesn't the whole procedure falls apart. That's what I meant when I said that the kernel leaves a non-Thumb frame; I now realize that I could have worded my post far better than I did :-)
To elaborate a bit further on this most unwinders I've looked at try to lookup the return address of the last frame before the signal handler in the EH to find the appropriate unwinding information. The return address however points to sigreturn() which is a syscall (thus in the 0xffff0000-0xffffffff area) and thus they can't find a matching EH entry; as a fallback they try using the FP but they don't find a valid one either and so they give up.

It might be that we just need to special-case the unwinding code by detecting that we're dealing with a signal handler stack frame and skipping over it to the last frame before the signal handler was invoked. After all the signal handler frame should always have the same size and layout.
Let me elaborate more on why I don't think signal handler frames are the
problem here.  Maybe I'm wrong and you can give us a new way forward towards
getting perf to work.

Suppose we compile the following file:

>  int bar(int);
>
>  int foo() {
>    int i;
>    for (i = 0; i < 100; i++) {
>      bar(i);
>    }
>    return i;
>  }

 With -marm -O2, I get
> 00000000 <foo>:
>    0:	e92d4010 	push	{r4, lr}
>    4:	e3a04000 	mov	r4, #0
>    8:	e1a00004 	mov	r0, r4
>    c:	e2844001 	add	r4, r4, #1
>   10:	ebfffffe 	bl	0 <bar>
>   14:	e3540064 	cmp	r4, #100	; 0x64
>   18:	1afffffa 	bne	8 <foo+0x8>
>   1c:	e1a00004 	mov	r0, r4
>   20:	e8bd8010 	pop	{r4, pc}

All that's relevant for our purposes are the push and pop statements, so we can
ignore the rest.

Is the frame here unwindable by the kernel code in comment 12?  Well, the
kernel code is looking for a frame which contains three pointers, {fp, sp, lr}.
This frame can't possibly work, because it pushes only two registers.

Suppose we compile instead with -marm -O2 -mapcs-frame.  Then our function
prelude is

> 00000000 <foo>:
>    0:	e1a0c00d 	mov	ip, sp
>    4:	e92dd818 	push	{r3, r4, fp, ip, lr, pc}

Now we have {fp, sp, lr}  in the stack frame, and it turns out that the kernel
can unwind this frame.

Compare this to the prelude when we compile with -mthumb -O2:

> 00000000 <foo>:
>    0:	b510      	push	{r4, lr}

That's  not going to work; it's the same as -marm -O2.  What about -mthumb -O2
-mtpcs-frame -mtpcs-leaf-frame -fno-omit-frame-pointer?

> 00000000 <foo>:
>    0:	b084      	sub	sp, #16
>    2:	b598      	push	{r3, r4, r7, lr}

In order for this to work, gcc must be storing the fp in r4, because the kernel
is expecting {fp, sp, lr}.  (The kernel doesn't do anything with the sp value,
so it would be ok if r7 was used as a general-purpose register.)

Let's see if we can figure out which register is used as the fp in this
disassembly (again, generated with -mthumb -O2 -mtpcs-frame -mtpcs-leaf-frame
-fno-omit-frame-pointer):

> 00000000 <foo>:
>    0:	b084      	sub	sp, #16
>    2:	b598      	push	{r3, r4, r7, lr}
>    4:	ab08      	add	r3, sp, #32
>    6:	9305      	str	r3, [sp, #20]
>    8:	467b      	mov	r3, pc
>    a:	9307      	str	r3, [sp, #28]
>    c:	465b      	mov	r3, fp
>    e:	9304      	str	r3, [sp, #16]
>   10:	4673      	mov	r3, lr
>   12:	9306      	str	r3, [sp, #24]
>   14:	ab07      	add	r3, sp, #28
>   16:	469b      	mov	fp, r3
>   18:	af00      	add	r7, sp, #0
>   1a:	2400      	movs	r4, #0
>   1c:	1c20      	adds	r0, r4, #0
>   1e:	3401      	adds	r4, #1
>   20:	f7ff fffe 	bl	0 <bar>
>   24:	2c64      	cmp	r4, #100	; 0x64
>   26:	d1f9      	bne.n	1c <foo+0x1c>
>   28:	46bd      	mov	sp, r7
>   2a:	2064      	movs	r0, #100	; 0x64
>   2c:	bc98      	pop	{r3, r4, r7}
>   2e:	bc0e      	pop	{r1, r2, r3}
>   30:	4693      	mov	fp, r2
>   32:	469d      	mov	sp, r3
>   34:	4708      	bx	r1

r4 is used as the loop control variable (cmp r4, #100), so it's not being used
as a frame pointer.

Again, as far as I can tell from reading the source code, the kernel unwinder
does not look for EH entries, or any elf magic.  It looks only at stack frames.

The kernel does not have both ARM and Thumb unwind code, so, while I don't
think it's wrong to call the problem an ABI mismatch, the issue isn't that the
kernel is unwinding through an ARM signal handler into Thumb code and not
noticing the transition to Thumb.  The issue, as far as I can tell, is that the
kernel simply has no code whatsoever to unwind this sort of frame.

If we could convince gcc to push {fp, X, lr} in thumb mode, I think there
would be a (good) chance the kernel could unwind these thumb frames.
(In reply to Justin Lebar [:jlebar] from comment #19)
> Let me elaborate more on why I don't think signal handler frames are the
> problem here.  Maybe I'm wrong and you can give us a new way forward towards
> getting perf to work.

Just a small premise, if my ARM-fu is not too rusty the frame-pointer should be r7 in Thumb mode and r11 in ARM mode.

> In order for this to work, gcc must be storing the fp in r4, because the
> kernel
> is expecting {fp, sp, lr}.  (The kernel doesn't do anything with the sp
> value,
> so it would be ok if r7 was used as a general-purpose register.)

I can't rememeber right now what's the sp in Thumb mode but r7 is definitely the frame-pointer so this is {fp, lr}. I can't remember if r4 has a special role but it might.

> r4 is used as the loop control variable (cmp r4, #100), so it's not being
> used
> as a frame pointer.

I have to dig out the full ABI description but FP is definitely r7 in Thumb mode. The only thing I'm not sure of is if r4 has a special role or not.

> Again, as far as I can tell from reading the source code, the kernel unwinder
> does not look for EH entries, or any elf magic.  It looks only at stack
> frames.

I see, so for this to work we need a valid fp-chain in the stack; things like _Unwind_Bactrace() or GDB also look at the EH entries so I assumed the kernel might want to do the same.

> The issue, as far as I can tell, is that
> the kernel simply has no code whatsoever to unwind this sort of frame.

It seems so, but it might also be possible that it's not looking for the FP where it's supposed to be (i.e. in r7 instead of r11).

> If we could convince gcc to push {fp, X, lr} in thumb mode, I think there
> would be a (good) chance the kernel could unwind these thumb frames.

Yes, pretty much every unwinder should work then.
> It seems so, but it might also be possible that it's not looking for the FP where it's 
> supposed to be (i.e. in r7 instead of r11).

I encourage you to read the kernel code I've linked.  It's quite straightforward, and the relevant bit is only about a hundred lines.  It would certainly be possible for the kernel to look for {fp, lr} on the stack when we're in thumb mode, but as far as I can tell, it is not doing so.
cjones, how hard exactly would it be to rebuild our kernel?

If B2G works with -mthumb -mtpcs-frame -mtpcs-leaf-frame -fno-omit-frame-pointer (I need to check), then modifying the kernel so that it can walk these frames might be simpler than upgrading to gcc 4.6, fixing bug 832752, and fixing whatever roadblock I hit after that.
Flags: needinfo?(jones.chris.g)
(In reply to Justin Lebar [:jlebar] from comment #21)
> I encourage you to read the kernel code I've linked.  It's quite
> straightforward, and the relevant bit is only about a hundred lines. 

OK, it looks very straightforward indeed. I'm not sure why it doesn't walk correctly the -mthumb -O2 -mtpcs-frame -mtpcs-leaf-frame -fno-omit-frame-pointer case, the fp should be on the stack so it must be looking for it at the wrong offset.

> It would certainly be possible for the kernel to look for {fp, lr} on the stack
> when we're in thumb mode, but as far as I can tell, it is not doing so.

Yes, BTW I agree that it would be better to put a fix in the kernel if possible as I don't think we'll be able to work around this with the compiler. Did you have the chance to check what more recent GCCs outputs for this case? If not shall I do it?
> Did you have the chance to check what more recent GCCs outputs for this case?

I didn't realize it at the time, but the output from comment 19 is with gcc 4.6.  But gcc 4.4 has the same preludes, though.

> I'm not sure why it doesn't walk correctly the -mthumb -O2 -mtpcs-frame -mtpcs-leaf-frame -fno-omit-
> frame-pointer case, the fp should be on the stack so it must be looking for it at the wrong offset.

I think you're probably right that it's pushing {r7, lr} as {fp, lr}.  So the fp is on the stack.  But then the kernel is unwinding the stack using

> struct frame_tail {
>   struct frame_tail __user *fp;
>   unsigned long sp;
>   unsigned long lr;
> } __attribute__((packed));

which expects there to be a word between fp and lr.
We're getting there...

This is a perf report of starting up the FM radio app.  I attached perf to the idle preallocated process, then started the FM radio app.  Once it finished loading, I detached perf.

I stuck a strip -g'ed libxul.so on the phone, and if you scroll down a bit, you can see that we get pretty reasonable looking stacks from it.

But the topmost stack is still totally opaque, and my addr2line tool isn't helping.  I suspect we may be inside JS at the time and be unable to unwind out of it.
Attachment #703150 - Attachment is obsolete: true
Attachment #703293 - Attachment is obsolete: true
I'd forgotten to disable elfhack; doing so made this a lot better!

This still isn't very useful, though.  It seems like perf is unable to aggregate the top entry (the 70-something percenter), so it's just spitting out a bunch of random stack frames.

The top entry is the one that's not addr2line'ed by perf, so my guess is that the fact that it can't aggregate these frames is related to the fact that it can't addr2line them.
Attachment #705432 - Attachment is obsolete: true
(In reply to Justin Lebar [:jlebar] from comment #22)
> cjones, how hard exactly would it be to rebuild our kernel?
> 
> If B2G works with -mthumb -mtpcs-frame -mtpcs-leaf-frame
> -fno-omit-frame-pointer (I need to check), then modifying the kernel so that
> it can walk these frames might be simpler than upgrading to gcc 4.6, fixing
> bug 832752, and fixing whatever roadblock I hit after that.

We can get new kernel builds.  Please list exactly the config change you want and mwu or someone else can send out the request.
Flags: needinfo?(jones.chris.g)
> Please list exactly the config change you want and mwu or someone else can send out the 
> request.

I'd need to actually patch the kernel source.  That would be difficult if I couldn't compile it myself.

I could try writing a patch against a different version of the kernel, perhaps even targeting a different device (e.g. an Android phone).
A patch against the kernel tree would be even easier.  The baseline tree used by downstreams is in your b2g checkout as kernel/.  There should be some msm* config you can use to sanity check the patch.
cjones asked on IRC to see this.  It's pretty hard for me to read, but perhaps there's some interesting data here.
Attachment #705445 - Attachment is patch: false
It's a little hard to tell but it looks like we might be spending fairly significant time loading JS modules.  At any rate, about 20% of the time at the top of the profile is going to JS execution stuff.

Another 20% in the middle is painting.

Below that and ~12% is more JS execution, this time definitely of content script.  There's some HTML and CSS stuff going on in there too, maybe dynamically created content.

Then we have 6% or so running onload.

Below that is about 10% in screenshotting :(.

5% is showing up in layers code, but not in the tree where it should be.  Not sure what's up there.

Looks like the data is relatively useful.  Just need to load it in a UI that lets it talk to us :(.
> Looks like the data is relatively useful.

Just to be clear, that's assuming that the data is correct, which I myself am not ready to give a lot of confidence to.  :)
fwiw, cjones's latest profiling exploits have convinced me that there's a lot of value in being able to use SPS to profile.  So I'm going to see if anything I learned from this perf work can apply towards getting stacks out of SPS.
(In reply to Justin Lebar [:jlebar] from comment #33)
> fwiw, cjones's latest profiling exploits have convinced me that there's a
> lot of value in being able to use SPS to profile.  So I'm going to see if
> anything I learned from this perf work can apply towards getting stacks out
> of SPS.

The existing bug for that is 810526 which is assigned to me. I've put it on hold as work on bug 779291 was going on though I don't think that we'll have anything soon. IMHO there's two ways forward, either we help out finishing the last dependencies for breakpad-based SPS unwinding or we put together something real quick for B2G only.

On this topic I had a discussion recently with :mwu and we came up with the idea of trying to come up with our own minimal stack-walking code by re-using libgcc's existing facilities to read EH tables. We didn't check how feasible it is but it could be a quick way of having a decent stack walker (we'd need to add workarounds for signal handlers but that shouldn't be too complicated). If we could modify _Unwind_Backtrace() directly this would all be much easier but unfortunately that's not an option.
Note, the gecko profiler tells us what's going on in main threads, but if non-main threads are contributing to overhead, it will only be able to show that certain tasks are taking longer.  So there's still value in having a working perf, even if it can't show us samples in .js code, e.g.
> IMHO there's two ways forward, either we help out finishing the last dependencies for 
> breakpad-based SPS unwinding or we put together something real quick for B2G only.

I dont't think this has to be particularly complicated.  Instead, I think we can just do what the kernel does: Assume we're compiled with a standard frame (e.g. -mapcs-frame or -mtpcs-frame) and walk the stack manually.
(In reply to Justin Lebar [:jlebar] from comment #36)
> I dont't think this has to be particularly complicated.  Instead, I think we
> can just do what the kernel does: Assume we're compiled with a standard
> frame (e.g. -mapcs-frame or -mtpcs-frame) and walk the stack manually.

In that case it should be a real quick fix to NS_StackWalk() with only a small addition for the {fp, lr} case you encountered.
(In reply to Justin Lebar [:jlebar] from comment #16)
> -marm doesn't work with gcc 4.4

What I'm seeing is that a libxul.so built with |--with-thumb=no| using gcc 4.4 fails in B2G sometime after the startup animation runs but before the homescreen appears.  Is this the same "doesn't work"?
> a libxul.so built with |--with-thumb=no| using gcc 4.4 fails in B2G sometime after the 
> startup animation runs but before the homescreen appears. 

Yes; IIRC I was able to get to the lock screen but the device did something wrong when I tried to unlock (I forget if it crashed or simply didn't unlock).  In fact something is apparently miscompiled in the JS engine; we don't pass the JS tests with a JS shell built with gcc 4.4 -marm.  See bug 832379.
I'll take this; see also PR 810526.
Assignee: nobody → jld
Bug 810526, rather.  My finger macros need some adjusting.
It turns out that the instructions used to implement -mapcs-frame aren't quite legal for Thumb2 — LDMIA/LDMFD can't include SP (nor both LR and PC), STMDB/STMFD can't include PC (or SP), and general arithmetic can't write to SP.  Working around all of this is doable but less trivial than I thought, and then there's the question of what works vs. what upstream GCC might be willing to take.

But also: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0041c/Cegbidie.html declares that "Only APCS variants that do not require a frame pointer register are supported. APCS variants that require a frame pointer register are obsolete."
One useful thing about the code linked in comment #12 is that it doesn't even represent the saved PC field, and doesn't use the saved SP field (which is always going to be its own address + 12 in any case).  If I don't care what's in those then this is easier.

The hardest part might be getting GCC to leave r11 alone and not dead-code-eliminate writes to it (the prologue is mostly RTL, but the epilogue is printf'ed).  Making the hard frame pointer register (and GCC really wants there to be at most 1) be r11 should probably be avoided, I think, because that turns all stack accesses into 32-bit instructions.
I think I have something that works — here's a "screenshot" from `perf report`, with samples recorded while I was zooming/scrolling on a simple web page:  http://pastebin.mozilla.org/2160028

I haven't actually tried running perf on an unmodified B2G yet, but I infer from the early comments that I wouldn't be seeing reasonable-looking stacks like that if I did.

This involves changes to the Linaro Android GCC repo, the Linaro Android Toolchain Builder (for compatibility issues with GCC 4.4.3), and manually building the previous, and it's glued in via some lines in my .userconfig; so this needs some more work to be easily usable by people other than me.
> -   9.74%  HTML5 Parser  [kernel.kallsyms]  [k] finish_task_switch.clone.3

OT, but it's really weird to me that we're spending 10% of our time context switching.
…and trying to use `perf record -a`, to see if context switching was actually taking that much of our time, turned up a bug in perf with using --kallsyms on something that isn't a KVM guest and getting samples from kernel modules.  (Being able to run `perf report` on a host other than the phone itself seems useful.)  I have a workaround for it, but it probably breaks other things.

Also, trying to cross-build perf is... not trivial.  The binaries linked in comment #0 work, but it would be good to be able to build them.
Okay.  The modified GCC can be built thusly:

 mkdir linaro-toolchain
 cd linaro-toolchain
 repo init -b b2g -u git://github.com/jld/linaro-android-toolchain-manifest.git
 repo sync
 cd build
 LINARO_BUILD_EXTRA_CONFIGURE_FLAGS=--disable-graphite LINARO_BUILD_EXTRA_MAKE_FLAGS=-j10 ./linaro-build.sh --with-gcc=gcc-4.4.3

Adjust directory name and build parallelism as desired.  Note that the repo sync will pull down about a gig of Android bits you probably already have, and another half gig or so or GCC and other things.  Note also that that will install to /tmp/android-toolchain-eabi by default, and if that's not desired then linaro-build.sh takes a --prefix flag and adjust the next part of this accordingly.

Then, once https://github.com/mozilla-b2g/gonk-misc/pull/73 is merged, append this to a B2G .userconfig:

 export TARGET_TOOLS_PREFIX=/tmp/android-toolchain-eabi/bin/arm-linux-androideabi-
 export CFLAGS="-mapcs-frame -mthumb2-fake-apcs-frame"
 export CXXFLAGS="-mapcs-frame -mthumb2-fake-apcs-frame"

Note that --enable-profiling is not needed here, and -fomit-frame-pointer doesn't omit this frame pointer.  Finally, rm -rf objdir-gecko (and maybe also rm -rf out) and rebuild.

This costs 8 bytes of .text per function (and some number of cycles that I haven't tried to determine) and 8 bytes of stack per activation.  And empirically, if a trace is inside code we don't build (e.g., GL drivers) perf seems to somehow find the frame chain anyway, after a couple of nonsense addresses.

Actually running perf is still overly complicated; apparently there was automation for this at some point but its current status is unclear.

And, of course, the toolchain build should be integrated into the B2G build and (I assume) hosted in something other than my github account.
Cross-compiling perf really does not want to happen for us — it needs to include some headers from the kernel tree it's part of, which mix with the headers from our sysroot in a way that I haven't been able to make not broken.  There's documentation which suggests that cross-compiling for Android would work with a newer NDK, but given how much obvious bit-rot there is in the perf codebase (documented make variables that break the build, documented flags that cause segfaults) I'm not inclined to take it at its word.

The binaries in update #0 are statically linked against glibc, possibly from a non-cross build on a more normal ARM Linux host.  That's one option, but it doesn't make it easy for us to build the target perf as part of our build.  I don't know how much we care about that.  I briefly considered cross-building glibc or eglibc, but that's an elaborate process that involves compiling GCC three (3) separate times.

Another option I've been looking into is writing a minimal replacement for `perf record`, without the kernel header dependencies, since that's most of what we'd want to run on the target, I think.  Except the file format isn't documented.  Or, rather, it appears to be “documented” by C code with approximately no comments.  There's a PDF floating around[1] that explains some of it, written as some sort of student project by someone else who apparently found perf itself opaque enough to want to reimplement the record file interface (the reader, in that case), but it's still vague in places.  Fortunately, most of the record file is copied directly from shared-memory buffers populated by the target kernel, but there are a few levels of headers to deal with.

If/when we start building our own kernels — and I don't know what impediments remain there — this may become simpler.  More generally, the “right” approach here may depend on what our longer-term plans for this system are.

[1] https://openlab-mu-internal.web.cern.ch/openlab-mu-internal/03_Documents/3_Technical_Documents/Technical_Reports/2011/Urs_Fassler_report.pdf
I think I've figured out how to turn Debian's libc packages into enough of a sysroot to build perf against.  That should be reasonably easy to automate.  (I have also succeeded in doing a self-hosted build in a chroot on an unagi.  It works, but I don't recommend it.)

In the meantime, https://people.mozilla.com/~jedavis/perf-static-armhf/perf is a binary I've built.  This is arguably not much of an improvement over comment #0.
Recent discovery: the CFLAGS/CXXFLAGS settings in comment #47 don't work for the Android parts of the build, which means the frame pointer chain can break in libc (among other places), which is unfortunate when the stacks start in the kernel.  I could swear this was working when I first tried it, but in any case it doesn't work now; see bug 850041.

Also, the Linaro toolchain builder doesn't quite work; there's a strange linker error which seems to be caused by the difference in binutils version.  That can probably be worked around on its own.

It's possible to "fix" this like so:

--- a/gcc-4.4.3/gcc/config/arm/bpabi.h
+++ b/gcc-4.4.3/gcc/config/arm/bpabi.h
@@ -45 +45,2 @@
-#define TARGET_DEFAULT (MASK_INTERWORK | TARGET_ENDIAN_DEFAULT)
+#define TARGET_DEFAULT (MASK_INTERWORK | TARGET_ENDIAN_DEFAULT \
+           | MASK_APCS_FRAME | MASK_THUMB2_FAKE_APCS_FRAME)

and by copying the cc1 and cc1plus binaries over the ones in the Android prebuild toolchain checkout.  This is not really a sustainable way of doing this, I think.

And there's more: there's inline assembly in the Android graphics code that clobbers r11, which GCC n(reasonably) forbids when r11 is a fixed register.  I've changed an ifdef so it uses C code instead; that or something else will have to go into gonk-patches, one way or another.

In the meantime, https://github.com/jld/b2g-toolchain-prebuilt has the compiler that doesn't quite work, and perf binaries for host and target (assuming the host is Linux on a PC), and the script that built them.
It gets worse.  There's a reason those other fields are there, and I'm not sure exactly what perf is doing but it's probably wrong.

If you're at a point after a branch-and-link but before the new frame has been committed to FP, then you need to include the current LR, after SP and before the string of saved LRs; but if not, then you need to *not* use LR, because it's either a copy of the first saved LR or (more importantly) may be of some activation that is no longer on the stack.

This is why the APCS frame saves PC, if I understand correctly: so you can compare it to the actual PC and the bounds of the enclosing function and see if there's an unsaved LR (assuming the compiler doesn't move the prologue through a branch or anything).  But we can't do that in Thumb2, at least not directly.  And it's worse than that: since the LR (r14) saved by the STMDB is loaded into PC (r15) in the LDMIA that returns, there is no register larger than 15 to use to pop the saved PC.  And this is why the otherwise uninformative saved SP is there: so the return can disable writeback of the not-incremented-enough SP and instead reload SP from the stack.  Which also can't be done in Thumb2.

Except: x86 has the same problem.  The "link register" is the top-of-stack word, and there's a window between the CALL and the write to rBP with the same problem, and I don't think any ABIs have tried to mostly-fix it the way APCS did.  So maybe we don't need to care.

Also, it would be possible at least in principle to save the probably-meaningless LR and postprocess it out with information from debug sections, but that's more of a thought for bug 810526.
I think my proof-of-concept finally works.  To try: add git://github.com/jld/B2G.git as a remote, check out the "profiling" branch from it, set B2G_PROFILING=1 in .userconfig, then config/build/flash.  This needs to rebuild everything, and it probably needs `rm -rf out` and either deleting or changing the gecko objdir to work properly.  When that's done, `./run-perf.sh record -a -g` and do some stuff, then `./run-perf.sh report` to be see what happened (and/or be confused by the perf report UI).

If this goes anything like the last few times I thought things were mostly working, something will probably break.  Supports Linux hosts only, at present — and since I'm running perf report on the host and perf is highly Linux-dependent, that may be an ongoing issue.

But if this *does* work, it's also a constructive proof that the Linaro toolchain builder can be used (with some fixes and some careful choices of configuration) for B2G.
I'm trying to build, but get

/home/kanru/zone2/mozilla/B2G/prebuilt/alt-toolchain/toolchain-4.4.3/x86_64-linux-gnu/bin/../lib/gcc/arm-linux-androideabi/4.4.3/../../../../arm-linux-androideabi/bin/ld: /usr/lib32/libstdc++.so.6: version `GLIBCXX_3.4.15' not found (required by /home/kanru/zone2/mozilla/B2G/prebuilt/alt-toolchain/toolchain-4.4.3/x86_64-linux-gnu/bin/../lib/gcc/arm-linux-androideabi/4.4.3/../../../../arm-linux-androideabi/bin/ld)
collect2: ld returned 1 exit status
Hmm, maybe my libstdc++6 is too old.
Awesome! This works for me on ubuntu 12.04 (x64)!

I had to manually push target perf binary to /cache after flashing but after I did that everything worked! I can't wait to play around with this.
Confirmed this works for me too after switch to Debian unstable (x64)

I had to push perf to the target too.

 $ adb push prebuilt/alt-toolchain/perf/arm-linux-androideabi-perf /cache/perf
Sorry about that; I thought I'd remembered to actually write the line to push the perf, or at the very least test after reflashing (which would have wiped /cache), but apparently not.  Fixed.

As for the library versions: sorry about that, too; in theory, I know too much about library/ABI versioning to forget that.  (In practice, theory and practice are different.)  It should be possible to cd into prebuilt/alt-toolchain and run ./build.sh (which will fail to rebuild the target perf at present, but that's not important here), though I haven't tried to build on an older system yet.  I'm only doing the prebuilt thing here because the binaries are *much* smaller than the source repositories.
I've successfully converted a perf data file into an approximation of SPS format, by writing a Python script to do sketchy regexp matches over the output of `perf report -D`, and it looks reasonable in Cleopatra.  (The kernel frames don't have symbols because I haven't added that to profile-symbolicate yet.)  Once it's a little more finished, this may be helpful for extracting information beyond what the perf report user interface allows for.
Better: convert an entire `perf record -a` profile into a single SPS file, one "thread" per CPU, with pseudo-frames for each process and thread, and symbols for everything — the kernel and kernel modules, PIC/ASLRed userland binaries, even userland binaries where we only have .dynsym (though they may not be entirely right in that case).  This might be useful for seeing how different processes interact; and, if not, one can focus the pseudo-frame for a process or thread of interest. 

I've made it usable and pushed it; `./run-perf.sh sps` to convert a previously captured profile (which needs to be with -a instead of -p, currently; the text dump is formatted differently), or `./run-perf.sh record-sps` to do record -a -g and convert it.  There will probably be "no symbols" warnings about a few things; they're probably harmless (as long as they're not in /system/b2g or something like libc.)

One problem I've noticed: perf doesn't record the topmost user-mode stack frame.  I think it's a one-line change to the kernel, to save the user PC before walking the FP chain, but the stack issues really need their own bug.
(In reply to Jed Davis [:jld] from comment #60)
> This might be useful for seeing how
> different processes interact; and, if not, one can focus the pseudo-frame
> for a process or thread of interest. 

This absolutely is fantastic! It's far more useful than this because it allows us also to monitor system effects and those had been very hard to investigate up to this point (and we've already been bitten by them more than once).
Whiteboard: c=performance
Status: NEW → ASSIGNED
Keywords: perf
Whiteboard: c=performance → c= p=5
Whiteboard: c= p=5 → [c=profiling p=5]
Dear all,

I tried to run the mini-perf record in nexus-4 today,by adding the jld miniperf project to the main branch,But in the target am getting the error

root@android:/ # miniperf-record -o /cache/perf.data  & perf_pid=$!
[1] 826
root@android:/ # perf_event_open: No such device


Kernel have a all perf config enabled 
$ zgrep "_PERF[_= ]" config.gz 
# CONFIG_CGROUP_PERF is not set
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_USE_VMALLOC=y
CONFIG_PERF_EVENTS=y
CONFIG_PERF_COUNTERS=y
CONFIG_DEBUG_PERF_USE_VMALLOC=y
CONFIG_HW_PERF_EVENTS=y
# CONFIG_CLS_U32_PERF is not set
CONFIG_MMC_PERF_PROFILING=y

Any idea 

Thanks
Dhyan
I have a guess about why comment #62 happened.  I looked at revision 7ce11cd of the kernel source (obtained from another report of Nexus 4 perf_event problems, not involving miniperf-record so probably unrelated), which is available from git://codeaurora.org/kernel/msm, and the only ENODEV in kernel/events/core.c is this:

                /*
                 * We could be clever and allow to attach a event to an
                 * offline CPU and activate it when the CPU comes up, but
                 * that's for later.
                 */
                if (!cpu_online(cpu))
                        return ERR_PTR(-ENODEV);

So maybe some of the CPUs aren't online.  I've run miniperf-record on an 8-core x86 host, but not on a Nexus 4 or other multicore Android device, so I think I haven't run into this because Android manages its CPUs differently.

If I'm right, then (sadly) miniperf-record is broken on nexus-4 until I can figure out the right fix for this.
(In reply to Jed Davis [:jld] from comment #63)
...
> So maybe some of the CPUs aren't online.  I've run miniperf-record on an
> 8-core x86 host, but not on a Nexus 4 or other multicore Android device, so
> I think I haven't run into this because Android manages its CPUs differently.

This could be close to the mark.  The same error is produced when requesting
a cpu that is off-line when running 'perf' on a Nexus 4.  E.g.

perf record -C 1                                       
Fatal: No such device - did you specify an out-of-range profile CPU?
(In reply to Jed Davis [:jld] from comment #63)
...
> So maybe some of the CPUs aren't online.  I've run miniperf-record on an
> 8-core x86 host, but not on a Nexus 4 or other multicore Android device, so
> I think I haven't run into this because Android manages its CPUs differently.

A possible work-around might be to enable all the CPUs:

adb shell
stop mpdecision

Then turn on all the CPUs:
echo 1 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu2/online
echo 1 > /sys/devices/system/cpu/cpu3/online
Hi Jed & Douglas

I was also trying the cpu_online area,

and today I just tried a comparison study why perf is working in the pandboard and not in nexus-4

Both are multicore(Panda is 2 core and Nexus is 4 core) and in both default configuartion have SMP enabled  and cpu hot plug

In case of Panda I am able to switch on/off the cpu using sysfs online entry

But in case of nexus-4 i am not able to do the same.

And perf is failing in case of offline cpu.

nexus-4
=================
130|root@android:/sys/devices/system/cpu # ls -l
drwxr-xr-x root     root              2013-07-31 18:23 cpu0
drwxr-xr-x root     root              2013-07-31 18:23 cpu1
drwxr-xr-x root     root              2013-07-31 18:23 cpu2
drwxr-xr-x root     root              2013-07-31 18:23 cpu3

2.Not Able to turn on and off cpu 
=============================
130|root@android:/sys/devices/system/cpu # cat cpu1/online                     
0
root@android:/sys/devices/system/cpu # echo 1 > cpu1/online                    
root@android:/sys/devices/system/cpu # cat cpu1/online                         
0
root@android:/sys/devices/system/cpu # echo 1 > cpu2/online                    
root@android:/sys/devices/system/cpu # cat cpu2/online                         
0
Incase of panda
===============
root@android:/sys/devices/system/cpu # cat cpu1/online                         
0
root@android:/sys/devices/system/cpu # echo 1 > cpu1/online

root@android:/sys/devices/system/cpu # cat cpu1/online                         
1
(In reply to Dhyan from comment #66)
> Hi Jed & Douglas
> 
> I was also trying the cpu_online area,
...
> In case of Panda I am able to switch on/off the cpu using sysfs online entry
> 
> But in case of nexus-4 i am not able to do the same.

It works for me, but it is necessary to 'stop mpdecision' first.  You might want to 'start mpdecision' when done to save power.
Thank you Douglas !!

After stopping mpdecision i am able to turn on cpu..

(In reply to Douglas Crosher [:dougc] from comment #67)
> (In reply to Dhyan from comment #66)
> > Hi Jed & Douglas
> > 
> > I was also trying the cpu_online area,
> ...
> > In case of Panda I am able to switch on/off the cpu using sysfs online entry
> > 
> > But in case of nexus-4 i am not able to do the same.
> 
> It works for me, but it is necessary to 'stop mpdecision' first.  You might
> want to 'start mpdecision' when done to save power.
changing this to a meta bug as there is a whole tree of dependent bugs that are in our perf backlog.

as per discussion with mlee, this gets an s= value for the soonest sprint where a dependent bug is due to be delivered and will be carried forward to subsequent sprints until all dependent bugs are closed.
Summary: Get perf working on b2g → [META] Get perf working on b2g
Whiteboard: [c=profiling p=5] → [c=profiling s=2013.09.06 p=5]
Whiteboard: [c=profiling s=2013.09.06 p=5] → [c=profiling s=2013.09.06 p=0]
Whiteboard: [c=profiling s=2013.09.06 p=0] → [c=profiling s= p=0]
Keywords: perf
Whiteboard: [c=profiling s= p=0]
Whiteboard: [perf-reviewed]
Blocks: 962689
Jed, any chance you have perf working on nexus-4 or nexus-5?  I checked out your profiling branch, but it hasn't been rebased recently enough to have the nexus manifests.
Flags: needinfo?(jld)
(In reply to Ben Kelly [:bkelly] from comment #70)
> Jed, any chance you have perf working on nexus-4 or nexus-5?  I checked out
> your profiling branch, but it hasn't been rebased recently enough to have
> the nexus manifests.

No, I never got perf working on those phones.

The current vague plan, such as there is, is to drop https://android.googlesource.com/platform/external/linux-tools-perf/ into our tree and see if it works; if it does (or it's not too painful to make it work), then that takes care of all the perf tool build problems I struggled with.  I'd like to get away from the "miniperf-record" thing and integrate with upstream if possible, and that would certainly help.  (Especially if we can get the reporting commands from perf to build on OS X hosts — assuming that there's still a significant audience that uses Macs and doesn't just build in a Linux VM.  But it's not the end of the world if we can't, since we have the Gecko profiler for more common use cases now.)

And then there are the kernel patches.  Some things probably still need to be fixed upstream, but newer versions might need less backporting than the 3.0.x series I was dealing with for the ICS-based devices.  And there are some questions about how to integrate all of this into something usable.
Flags: needinfo?(jld)
While working on webrtc, profiling with perf has really been an invaluable tool. However since it wasn't supported for jb or kk devices, I went ahead and ported jld's work on the toolchain to gcc 4.6 and 4.7. It works well on a dolphin (kitkat) and StevenLee and I are ironing out issues for the Flame as well.

Instructions:
1. clone the profiling branch here: https://github.com/comoyo/b2g/tree/profiling
2. ./config.sh <device>
2.1. If there's no manifest for the device, add it yourself in B2G/profiling/manifests and modify config.sh accordingly. Check the ones already there for what to modify.
3. Make sure .userconfig has B2G_PROFILING=1, and that $TARGET_TOOLS_PREFIX is unset.
4. Clobber if you are in an existing build folder.
5. ./build.sh && ./flash.sh
6. ./run-perf.sh record-sps
7. Open in cleopatra: http://people.mozilla.org/~bgirard/cleopatra/
We actually can drop https://android.googlesource.com/platform/external/linux-tools-perf/ into our tree and it will build the tools — if the correct android-* tag for the device is checked out.  According to the makefiles it should even be able to build the host tool on Mac OS X, though I haven't tried this yet.

This still doesn't solve the problem of unwinding user-level stacks.  Chromium, for the record, is compiling with `-marm -mapcs-frame` (i.e., disabling use of Thumb/Thumb2, which didn't work so well for us, in order to use the obsolete APCS frame pointer option) — see https://code.google.com/p/chromium/wiki/ProfilingContentShellOnAndroid and https://code.google.com/p/chromium/codesearch#chromium/src/build/common.gypi&l=3612 (and judging by the comment “Thumb code with frame pointer makes chrome crash early.” someone ran into the GCC bug we worked around in bug 873332).  They're also doing that only for Chromium itself, which might explain why they avoided our problem with external/skia, but means they wouldn't be able to unwind through system libraries.

That said, I'd still like to try to get `perf (record|report) -g dwarf` to work.
Someone can try to make this happen, but at this point it probably won't be me.
Assignee: jld → nobody
Status: ASSIGNED → NEW
Closing this old B2G bug.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: