Closed Bug 926734 Opened 11 years ago Closed 6 years ago

[Flatfish][Profiling]: When starting up Gecko profiling tool, system got hang.

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: vliu, Unassigned, NeedInfo)

References

Details

(Whiteboard: [flatfish][TCP=breakage])

Attachments

(1 file, 1 obsolete file)

STR for this issue.

1. In homescreen, run |./profile.sh start| on PC
2. Wait until the tool runs finished.
   ./profile.sh start
   Stopping b2g ..... stopped.
   Removing old profile files (from device) ... done
   Starting b2g with profiling enabled ... started
3. Try to tab homescreen, the device got hang.
To clarify,
- What do you mean by "hang"? Do you see the startup animation and the lock screen? The device is responsive before you tap it (including using power and volume button)?
- Is b2g process still alive when the device hangs? If so then you may attach to it and show the backtraces of all threads to see why it freezes.
When you run |./profile.sh start|, it had the following flows.

   ./profile.sh start
   Stopping b2g ..... stopped.
   Removing old profile files (from device) ... done
   Starting b2g with profiling enabled ... started

But even done with the above flows, I don't see the startup animation and the lock screen. Checked the process status by watch-procrank.sh I can only see b2g process.
Whiteboard: [Flatfish only]
See Also: → 922548
Observed one source of crash:

SharedLibraryInfo::GetInfoForSelf() in tools/profiler/shared-libraries-linux.cc:

It scans /proc/%d/maps and take some transient maps into consideration. Those maps, or "false elfs", can disappear or change, for example, before stack unwinding or other usages.

A minor problem is that, the offsets may be larger than 4G, so the unsigned long offset can sometimes not fit.
A few thoughts:

Is Flatfish JB-based?  If so, bug 914190 could be the problem.

We should try to get the crash to happen under GDB, but from comment 2 it sounds like child processes are crashing on startup?  If so, there's an environment variable (I forget what it is, but it's on one of the wikis somewhere) to make each child process sleep for a while after it's created so a debugger can be attached.

Do we know that the problem is stale mappings from SharedLibraryInfo, or is that just a guess?
(In reply to Jed Davis [:jld] from comment #4)
> A few thoughts:
> 
> Is Flatfish JB-based?  If so, bug 914190 could be the problem.
> 

Yes, it's JB-based. The patch in bug 914190 was r- but it should work, right? I'll try it asap. BTW, the problem in comment 3 still exist no matter getline() works or not.

> We should try to get the crash to happen under GDB, but from comment 2 it
> sounds like child processes are crashing on startup?  If so, there's an
> environment variable (I forget what it is, but it's on one of the wikis
> somewhere) to make each child process sleep for a while after it's created
> so a debugger can be attached.
> 

Thanks. I tested with the following and is able to get the crashes by gdb. The crashes seemed to be very random.
./profile.sh start
or
MOZ_PROFILER_STARTUP=1 ./run-gdb.sh

> Do we know that the problem is stale mappings from SharedLibraryInfo, or is
> that just a guess?

I observed that SharedLibraryInfo contains things like
49db7000-49db8000 rwxs 800000ad000 00:0c 2147    /dev/pvrsrvkm

and the profiler tried to access addresses not in /proc/pid/maps. After removing
EHAddrSpace::Update(); in EHABIStackWalkInit()
This kind of crash vanished.
One of the crashes: (segfault on accessing |container|, which is NULL)

#0  nsLayoutUtils::FontSizeInflationFor (aFrame=<optimized out>) at ../../../gecko/layout/base/nsLayoutUtils.cpp:5418
#1  0x40f546b8 in FontSizeInflationListMarginAdjustment (aFrame=0x480c7b58)
    at ../../../gecko/layout/generic/nsHTMLReflowState.cpp:101
#2  nsCSSOffsetState::ComputeMargin (this=0xbea2d148, aHorizontalPercentBasis=<optimized out>, aVerticalPercentBasis=76800)
    at ../../../gecko/layout/generic/nsHTMLReflowState.cpp:2487
#3  0x40f54888 in nsCSSOffsetState::InitOffsets (this=<optimized out>, aHorizontalPercentBasis=76800, aVerticalPercentBasis=
    76800, aFrameType=0x43c55ce0, aBorder=0x0, aPadding=0x0) at ../../../gecko/layout/generic/nsHTMLReflowState.cpp:2166
#4  0x40f54cf6 in nsHTMLReflowState::InitConstraints (this=0xbea2d148, aPresContext=0x4747e400, aContainingBlockWidth=76800, 
    aContainingBlockHeight=43800, aBorder=0x0, aPadding=0x0, aFrameType=0x43c55ce0)
    at ../../../gecko/layout/generic/nsHTMLReflowState.cpp:1928
#5  0x40f556c2 in nsHTMLReflowState::Init (this=0xbea2d148, aPresContext=0x4747e400, aContainingBlockWidth=76800, 
    aContainingBlockHeight=43800, aBorder=0x0, aPadding=0x0) at ../../../gecko/layout/generic/nsHTMLReflowState.cpp:319
#6  0x40f55968 in nsHTMLReflowState::nsHTMLReflowState (this=0xbea2d148, aPresContext=0x4747e400, aParentReflowState=..., 
    aFrame=<optimized out>, aAvailableSpace=..., aContainingBlockWidth=76800, aContainingBlockHeight=43800, aFlags=0)
    at ../../../gecko/layout/generic/nsHTMLReflowState.cpp:190
#7  0x40f31822 in nsAbsoluteContainingBlock::ReflowAbsoluteFrame (this=<optimized out>, aDelegatingFrame=0x4809fdc0, 
    aPresContext=0x4747e400, aReflowState=..., aContainingBlock=..., aConstrainHeight=true, aKidFrame=0x480c7b58, aStatus=
    @0xbea2d240, aOverflowAreas=0xbea2d600) at ../../../gecko/layout/generic/nsAbsoluteContainingBlock.cpp:390
#8  0x40f31af4 in nsAbsoluteContainingBlock::Reflow (this=0x479ffef8, aDelegatingFrame=0x4809fdc0, aPresContext=0x4747e400, 
    aReflowState=..., aReflowStatus=@0xbea2d3dc, aContainingBlock=..., aConstrainHeight=true, aCBWidthChanged=true, 
    aCBHeightChanged=true, aOverflowAreas=0xbea2d600) at ../../../gecko/layout/generic/nsAbsoluteContainingBlock.cpp:137
#9  0x40f3803e in nsBlockFrame::Reflow (this=0x4809fdc0, aPresContext=0x4747e400, aMetrics=..., aReflowState=<optimized out>, 
    aStatus=@0xbea2d718) at ../../../gecko/layout/generic/nsBlockFrame.cpp:1165
#10 0x40f31894 in nsAbsoluteContainingBlock::ReflowAbsoluteFrame (this=<optimized out>, aDelegatingFrame=<optimized out>, 
    aPresContext=0x4747e400, aReflowState=..., aContainingBlock=..., aConstrainHeight=true, aKidFrame=0x4809fdc0, aStatus=
    @0xbea2d718, aOverflowAreas=0xbea2dbc0) at ../../../gecko/layout/generic/nsAbsoluteContainingBlock.cpp:415
#11 0x40f31af4 in nsAbsoluteContainingBlock::Reflow (this=0x479ffea0, aDelegatingFrame=0x4805be88, aPresContext=0x4747e400, 
    aReflowState=..., aReflowStatus=@0xbea2d8b4, aContainingBlock=..., aConstrainHeight=true, aCBWidthChanged=true, 
    aCBHeightChanged=true, aOverflowAreas=0xbea2dbc0) at ../../../gecko/layout/generic/nsAbsoluteContainingBlock.cpp:137
#12 0x40f3803e in nsBlockFrame::Reflow (this=0x4805be88, aPresContext=0x4747e400, aMetrics=..., aReflowState=<optimized out>, 
    aStatus=@0xbea2da7c) at ../../../gecko/layout/generic/nsBlockFrame.cpp:1165
#13 0x40f3d9a4 in nsContainerFrame::ReflowChild (this=<optimized out>, aKidFrame=0x4805be88, aPresContext=0x4747e400, 
    aDesiredSize=..., aReflowState=..., aX=0, aY=0, aFlags=3, aStatus=@0xbea2da7c, aTracker=0x0)
    at ../../../gecko/layout/generic/nsContainerFrame.cpp:961
#14 0x40f4d6e4 in nsHTMLScrollFrame::ReflowScrolledFrame (this=0x4805bbd8, aState=0xbea2dc9c, aAssumeHScroll=<optimized out>, 
    aAssumeVScroll=<optimized out>, aMetrics=0xbea2db94, aFirstPass=true)
    at ../../../gecko/layout/generic/nsGfxScrollFrame.cpp:441
Attached patch bug926734-unwind-no-rwx-hg0.diff (obsolete) — Splinter Review
Does this patch help?  It should exclude the writable/shared mappings from the graphics drivers.  I don't understand why they need to be executable, but I assume we can't easily change that.

The right solution is probably to have a real dl_iterate_phdr on b2g instead of walking the maps; bug 827846 was looking into this, but I don't know what its current status is.
Flags: needinfo?(thuang)
Partially. There are still other kinds of crashes remaining. Here are the files after filtering:

8000-23000 r-xp 0 /system/b2g/b2g
4007c000-4008a000 r-xp 0 /system/bin/linker
40097000-400d9000 r-xp 0 /system/b2g/libmozglue.so
400df000-400e2000 r-xp 0 /system/lib/liblog.so
400e4000-40129000 r-xp 0 /system/lib/libc.so
40139000-4013a000 r-xp 0 /system/lib/libstdc++.so
4013c000-40151000 r-xp 0 /system/lib/libm.so
40153000-4015d000 r-xp 0 /system/lib/libui.so
4015f000-4016c000 r-xp 0 /system/lib/libcutils.so
4017d000-4017e000 r-xp 0 /system/lib/libhardware.so
40180000-40181000 r-xp 0 /system/lib/libsync.so
40183000-4019b000 r-xp 0 /system/lib/libutils.so
4019e000-401a1000 r-xp 0 /system/lib/libcorkscrew.so
401a3000-401a7000 r-xp 0 /system/lib/libgccdemangle.so
401aa000-401c0000 r-xp 0 /system/lib/libz.so
401c2000-401ff000 r-xp 0 /system/lib/libEGL.so
40207000-40247000 r-xp 0 /system/lib/libGLES_trace.so
40249000-4027d000 r-xp 0 /system/lib/libstlport.so
40281000-40285000 r-xp 0 /system/lib/libhardware_legacy.so
40288000-4028a000 r-xp 0 /system/lib/libwpa_client.so
4028c000-40291000 r-xp 0 /system/lib/libnetutils.so
40293000-402c7000 r-xp 0 /system/lib/libgui.so
402d3000-402f1000 r-xp 0 /system/lib/libbinder.so
402f7000-402fc000 r-xp 0 /system/lib/libGLESv2.so
402fe000-40300000 r-xp 0 /system/lib/libsuspend.so
4040b000-40454000 r-xp 0 /system/lib/libdbus.so
40457000-4046d000 r-xp 0 /system/lib/libcamera_client.so
40474000-40488000 r-xp 0 /system/lib/libdrmframework.so
4048c000-404a0000 r-xp 0 /system/lib/libexpat.so
404a3000-404a7000 r-xp 0 /system/lib/libgabi++.so
404a9000-404f7000 r-xp 0 /system/lib/libsonivox.so
408fe000-40a93000 r-xp 0 /system/b2g/libnss3.so
40aa4000-421ea000 r-xp 0 /system/b2g/libxul.so
42590000-42676000 r-xp 0 /system/lib/libstagefright.so
4267e000-42744000 r-xp 0 /system/lib/libcrypto.so
42759000-4286a000 r-xp 0 /system/lib/libicui18n.so
42872000-4295d000 r-xp 0 /system/lib/libicuuc.so
4296b000-429d2000 r-xp 0 /system/lib/libmedia.so
429e7000-429f2000 r-xp 0 /system/lib/libstagefright_foundation.so
429f4000-429f6000 r-xp 0 /system/lib/libaudioutils.so
429f8000-429fb000 r-xp 0 /system/lib/libspeexresampler.so
429fd000-429fe000 r-xp 0 /system/lib/libmedia_native.so
42a00000-42a32000 r-xp 0 /system/lib/libssl.so
42a38000-42a4a000 r-xp 0 /system/lib/libstagefright_omx.so
42a4d000-42a4f000 r-xp 0 /system/lib/libstagefright_yuv.so
42a51000-42a69000 r-xp 0 /system/lib/libvorbisidec.so
42a6b000-42a6c000 r-xp 0 /system/lib/libstagefright_enc_common.so
42a6e000-42a73000 r-xp 0 /system/lib/libstagefright_avc_common.so
42a75000-42a79000 r-xp 0 /system/lib/libsysutils.so
42a7c000-42a94000 r-xp 0 /system/lib/libsensorservice.so
42a99000-42aa2000 r-xp 0 /system/vendor/lib/hw/gralloc.flatfish.so
42aa4000-42ae2000 r-xp 0 /system/vendor/lib/libsrv_um.so
42ae5000-42aea000 r-xp 0 /system/vendor/lib/libpvr2d.so
42bf0000-42bf6000 r-xp 0 /system/vendor/lib/hw/hwcomposer.flatfish.so
42cf8000-42cf9000 r-xp 0 /system/lib/hw/power.default.so
ffff0000-ffff1000 r-xp 0 [vectors]

I added a check to remove [vectors] but the problem remains.
Flags: needinfo?(thuang)
The [vectors] entry shouldn't be a problem.  Are the remaining crashes in the stack unwinder, or somewhere else?  The crash in comment #6 appears to be unrelated to profiling.
Flags: needinfo?(thuang)
(In reply to Jed Davis [:jld] from comment #9)
> Are the remaining crashes in the stack unwinder, or somewhere else?

I don't known, yet. It's rather random, although none are in stack unwinder.

> The crash in comment #6 appears to be unrelated to profiling.

It only happens when profiling is enabled. I have been trying to narrow the problem down and it seems that even replacing the contents of the signal handler with an empty, time consuming loop the problem remains. I'm wondering if this is related to some kernel problem but have yet to verify other parts before sending SIGPROF. Any ideas?
Flags: needinfo?(thuang)
Does kernel need to enable any special functions to support gecko profile?
(In reply to thomas tsai from comment #11)
> Does kernel need to enable any special functions to support gecko profile?

AFAIK, No. On Linux, SPS is implemented with POSIX signal.
I saw several crashes in JS parser. The logs give me more confidence that JS parser felt something wrong. Combining with the symptoms observed in bug 922548, there are very likely some memory corruptions. I'm trying to run valgrind but blocked by the linker; The linker crashed immediately when being executed by valgrind.

Thomas, are there any special patches to bionic in this platform? Could you help run valgrind on it?
Flags: needinfo?(ttsai)
bionic is from aosp 4.2.2. There is no special patches for bionic. You can refer to "https://bitbucket.org/thomastsai/a31-b2g-manifest/src/6a8765e609490b79966f212dfcc6dbbd3bd1331a/base-jb.xml?at=aosp-4.2.2"
valgrind ever run to debug graphic performance couple weeks ago.
I will arrange one to run valgrind again.
Flags: needinfo?(ttsai)
Flags: needinfo?(dliang)
Comment on attachment 829579 [details] [diff] [review]
bug926734-unwind-no-rwx-hg0.diff

I filed bug 940167 for this.  If you need the patch on 1.2, I think the bug needs to be flagged koi? and go through triage for that.
Attachment #829579 - Attachment is obsolete: true
Can anyone confirm that this fixes the issue for Flatfish profiling?
Flags: needinfo?(vliu)
Ting-Yuan and dliang is still working on this issue. As Ting-Yuan said in Comment 13, they still focus on running valgrind to see the problem.
Flags: needinfo?(vliu)
hi Thomas,

this issue seems important for developer release, i intend to change it to 1.3+. do you have any concern?
Flags: needinfo?(ttsai)
Now, we can use valgrind to debug B2G successfully. 
Below is the SW information on valgrind:
Valgrind: 3.9.0 ( download here: http://valgrind.org/downloads/valgrind-3.9.0.tar.bz2)
Android-NDK: R9b (download here: http://dl.google.com/android/ndk/android-ndk-r9b-linux-x86_64.tar.bz2)
We will keep post if any further update.
Flags: needinfo?(dliang)
Flags: needinfo?(ttsai)
I've been looking more into this. The remaining crash seems to be strongly linked with the imgtec driver. Here's the information I have:
- Signaling the process doesn't cause the crash. (interrupted the sys calls the imgtec driver may of not been handled properly).
- Writing to a large heap allocated block (the profiler's circular buffer) is enough to trigger the crash. The crash seems to happen in a minimal signal handler if-and-only-if I touch this circular buffer.
- I tried playing and scribbling the popped stack in case imgtec was returning memory from the stack but that wasn't the case. It didn't make a difference.

I'm not sure how the imgtec driver could be causing this. There's 'gralloc_unregister_buffer' error. I tried patching GraphicsBuffer to log the alloc/free and they all seemed pair. Maybe these errors should be fixed before we continue to investigate this problem.

Vincent or Jerry do you see the gralloc error with flatfish? I see them without the profiler running.
Flags: needinfo?(vlin)
Flags: needinfo?(hshih)
I have thest message without enable profiler.
Are these error message the same as yours?

Vincent will ask the imgtec for this error message.


E/IMGSRV  ( 1273): :0: gralloc_device_alloc: Framebuffer/bypass usage bits are incompatible with non-GPU-renderable pixel format (1)
W/GraphicBufferAllocator( 1273): WOW! gralloc alloc failed, waiting for pending frees!
E/IMGSRV  ( 1273): :0: gralloc_device_alloc: Framebuffer/bypass usage bits are incompatible with non-GPU-renderable pixel format (1)
E/IMGSRV  ( 1273): :0: gralloc_unregister_buffer: Cannot unregister unregistered buffer (ID=9)
E/IMGSRV  ( 1273): :0: gralloc_unregister_buffer: Cannot unregister unregistered buffer (ID=10)
E/IMGSRV  ( 1273): :0: gralloc_unregister_buffer: Cannot unregister unregistered buffer (ID=16)
Flags: needinfo?(vlin)
Flags: needinfo?(hshih)
(In reply to Benoit Girard (:BenWa) from comment #20)
> I've been looking more into this. The remaining crash seems to be strongly
> linked with the imgtec driver. Here's the information I have:
> - Signaling the process doesn't cause the crash. (interrupted the sys calls
> the imgtec driver may of not been handled properly).
> - Writing to a large heap allocated block (the profiler's circular buffer)
> is enough to trigger the crash. The crash seems to happen in a minimal
> signal handler if-and-only-if I touch this circular buffer.
> - I tried playing and scribbling the popped stack in case imgtec was
> returning memory from the stack but that wasn't the case. It didn't make a
> difference.
> 
> I'm not sure how the imgtec driver could be causing this. There's
> 'gralloc_unregister_buffer' error. I tried patching GraphicsBuffer to log
> the alloc/free and they all seemed pair. Maybe these errors should be fixed
> before we continue to investigate this problem.
> 
> Vincent or Jerry do you see the gralloc error with flatfish? I see them
> without the profiler running.

I've also seen these error messages for a while, but alloc/free in Gecko is pair.

Terry:
We need your help to debug graphics hal.
Flags: needinfo?(phterry)
Depends on: 953018
I spun that off to 953018 in case the issues are not related.
Among the strange crashes, I focused on the one in inflate_fast(), and found that the value of variable |here| after assignment does not equal to the value of right hand side, which makes |bits| underflow after minus |op| as |op| is incorrectly larger than |bits|. Note the assignment goes to memcpy(), which uses Neon instructions.

After add |ARCH_ARM_USE_NON_NEON_MEMCPY := true| to device/allwinner/common/BoardConfigCommon.mk, I haven't ran into strange crashes or javascript errors.

I still don't know how is this happened, but ThreadProfile::addTag() also relies on memcpy() for the |mEntries| assignment.
(In reply to Ting-Yu Chou [:ting] from comment #24)
> After add |ARCH_ARM_USE_NON_NEON_MEMCPY := true| to
> device/allwinner/common/BoardConfigCommon.mk, I haven't ran into strange
> crashes or javascript errors.

Scratch this, I just ran into crashes even the flag is set.
So I added memcmp() after assignment in inflate_fast() like following, SIGTRAP would be raised randomly:

   here = dcode[hold & dmask];
   if (memcmp(&here, &dcode[hold & lmask], sizeof(code)))
       raise(SIGTRAP);

I googled around, couldn't find that whether bionic's memcpy() is async-signal-safe.
Modified bionic memcpy.S a bit to compare the result after 4 bytes copy:

  diff --git a/libc/arch-arm/bionic/memcpy.S b/libc/arch-arm/bionic/memcpy.S
  index 8453cc0..f1ad5f9 100644
  --- a/libc/arch-arm/bionic/memcpy.S
  +++ b/libc/arch-arm/bionic/memcpy.S
  @@ -154,6 +154,11 @@ ENTRY(memcpy)
   1:      bge         2f
           vld4.8      {d0[0], d1[0], d2[0], d3[0]}, [r1]!
           vst4.8      {d0[0], d1[0], d2[0], d3[0]}, [r0]!
  +        ldr         r3, [r1, #-4]
  +        ldr         ip, [r0, #-4]
  +        cmp         ip, r3
  +        beq         2f
  +        bkpt
   2:      movs        ip, r2, lsl #31
           ldrmib      r3, [r1], #1
           ldrcsb      ip, [r1], #1
  @@ -169,7 +174,6 @@ END(memcpy)
   
   #else   /* __ARM_ARCH__ < 7 */

Run b2g with command |$ MOZ_PROFILER_STARTUP=1 ./run-gdb.sh| and get the breakpoint hit. Note the content of d0[0], d1[0], d2[0] and d3[0] are not matched to src, which is why dst has different result from src:

  Program received signal SIGBUS, Bus error.
  memcpy () at bionic/libc/arch-arm/bionic/memcpy.S:161
  161	        bkpt
  (gdb) bt
  #0  memcpy () at bionic/libc/arch-arm/bionic/memcpy.S:161
  #1  0x41a9a8c0 in MOZ_Z_inflate_fast (strm=0x41a9a8c1, start=<optimized out>)
      at ../../../../gecko/modules/zlib/src/inffast.c:167
  #2  0x0000950c in ?? ()
  #3  0x0000950c in ?? ()
  Backtrace stopped: previous frame identical to this frame (corrupt stack?)
  (gdb) x/4xb $r1-4
  0x436635d8:	0x12	0x04	0x09	0x00
  (gdb) x/4xb $r0-4
  0xbe9f6c3c:	0x75	0x93	0x1e	0x43
  (gdb) p/x $d0
  $1 = {u8 = {0x75, 0x61, 0x76, 0x61, 0x69, 0x6c, 0x5f, 0x6f}, u16 = {0x6175, 
      0x6176, 0x6c69, 0x6f5f}, u32 = {0x61766175, 0x6f5f6c69}, u64 = 
      0x6f5f6c6961766175, f32 = {0x0, 0x0}, f64 = 0x8000000000000000}
  (gdb) p/x $d1
  $2 = {u8 = {0x93, 0x45, 0x53, 0x54, 0x4f, 0x52, 0x45, 0x20}, u16 = {0x4593, 
      0x5453, 0x524f, 0x2045}, u32 = {0x54534593, 0x2045524f}, u64 = 
      0x2045524f54534593, f32 = {0x164c0000, 0x0}, f64 = 0x0}
  (gdb) p/x $d2
  $3 = {u8 = {0x1e, 0x65, 0x31, 0x2d, 0x61, 0x63, 0x33, 0x62}, u16 = {0x651e, 
      0x2d31, 0x6361, 0x6233}, u32 = {0x2d31651e, 0x62336361}, u64 = 
      0x623363612d31651e, f32 = {0x0, 0x0}, f64 = 0x8000000000000000}
  (gdb) p/x $d3
  $4 = {u8 = {0x43, 0x33, 0x37, 0x34, 0x66, 0x36, 0x38, 0x36}, u16 = {0x3343, 
      0x3437, 0x3666, 0x3638}, u32 = {0x34373343, 0x36383666}, u64 = 
      0x3638366634373343, f32 = {0x0, 0x0}, f64 = 0x0}
  (gdb) 

I am trying to understand how to prevent compiler from replacing assignment by memcpy.
"-O0" doesn't help, ThreadProfile::addTag() still uses memcpy() for the ProfileEntry assignment.
What is the memcpy transformation? Is it a single assignment or optimizing multiple assignments? If so you can try to use the memory barrier stuff which will prevent optimizations across that boundary.
Is is a class instance assignment:

  void ThreadProfile::addTag(ProfileEntry aTag)
  {
    // Called from signal, call only reentrant functions
    mEntries[mWritePos] = aTag;

Disassembly:

  $ ./prebuilts/gcc/linux-x86/arm/arm-linux-androideabi-4.6/bin/arm-linux-androideabi-objdump -S ./objdir-gecko/tools/profiler/ProfileEntry.o

  Disassembly of section .text._ZN13ThreadProfile6addTagE12ProfileEntry:

  00000000 <_ZN13ThreadProfile6addTagE12ProfileEntry>:
     0:   b513            push    {r0, r1, r4, lr}
     2:   4604            mov     r4, r0
     4:   e88d 0006       stmia.w sp, {r1, r2}
     8:   2205            movs    r2, #5
     a:   6861            ldr     r1, [r4, #4]
     c:   6880            ldr     r0, [r0, #8]
     e:   fb02 1000       mla     r0, r2, r0, r1
    12:   4669            mov     r1, sp
    14:   f7ff fffe       bl      0 <memcpy>
Overload operator=?
That works. But I found not only ProfilerEntry::addTag() calls memcpy from the objects under profiler folder. Besides, maybe compiler puts some other non-async-signal-safe functions as well.

I was wondering is it possible to instruct compiler not to use memcpy in default assignment operator, but I couldn't find such an option.
(In reply to Ting-Yu Chou [:ting] from comment #32)
> That works. But I found not only ProfilerEntry::addTag() calls memcpy from
> the objects under profiler folder. Besides, maybe compiler puts some other
> non-async-signal-safe functions as well.
> 
> I was wondering is it possible to instruct compiler not to use memcpy in
> default assignment operator, but I couldn't find such an option.

gcc don't always use memcpy for assignment. It depends on your object's size.
You could use -fno-builtin to disable gcc to use builtin function(in this case, memcpy).
Or maybe you can use #pragam with push, pop to just set one function scope with your customized option.
Here are the symbols I found from ojbects which may be signal unsafe. The *-prefixed function means it is not invoked explicitly from source or header, i.e., may be inserted by compiler.

Lul*.o
ns*.o
platform.o
UnwinderThread2.o
local_debug_info_symbolizer.o
  Seems the signal handler does not touch those.

AutoObjectMapper.o
  munmap,
  *memcpy,
  snprintf,
  mmap

BreakpadSampler.o
  strlen

EHABIStackWalk.o
  *puts,
  *memcpy,
  *memmove,
  memcmp,

JSStreamWriter.o
  strlen,
  snprintf

platform-linux.o
  strlen,
  strtok_r,
  malloc,
  strncpy,
  stncmp,
  free,
  *memmove,
  *memcpy,

ProfileEntry.o
  *memset,
  *memcpy,
  free,
  strdup,
  snprintf,
  *strlen,

ProfilerMarkers.o
  free,
  strdup,

shared-libraries-linux.o
  *memset,
  *memcpy,
  *free,
  snprintf,
  sscanf,
  strchr,
  strcmp,
  *strlen,
  *malloc

TableTicker.o
  memset,
  memcpy,
  strlen,
  memset,
  strcmp
I will check the toolchain tomorrow. 4 bytes structure assignment transforms to bionic's memcpy invocation is strange to me, it seems library functions are used excessively.
(In reply to Ting-Yu Chou [:ting] from comment #27)
>            vld4.8      {d0[0], d1[0], d2[0], d3[0]}, [r1]!
>            vst4.8      {d0[0], d1[0], d2[0], d3[0]}, [r0]!

What happens if these instructions are replaced with ldr/str?  This looks like it's just a 4-byte copy — and maybe there's a bug with how these particular instructions are implemented, or how they're restarted after being interrupted, or something like that.

I don't see why that sequence would be async signal unsafe unless there's a hardware bug or kernel bug.

>   (gdb) x/4xb $r1-4
>   0x436635d8:	0x12	0x04	0x09	0x00
>   (gdb) x/4xb $r0-4
>   0xbe9f6c3c:	0x75	0x93	0x1e	0x43

Is 0x431e9375 the address of anything interesting?  (`info symbol 0x431e9375`, or look at nearby memory)

It looks as if d0 through d3 are ASCII text except for the first byte of each:

>   (gdb) p/x $d0
>   $1 = {u8 = {0x75, 0x61, 0x76, 0x61, 0x69, 0x6c, 0x5f, 0x6f}

"u" "avail_o" — maybe from "avail_out"?

>   (gdb) p/x $d1
>   $2 = {u8 = {0x93, 0x45, 0x53, 0x54, 0x4f, 0x52, 0x45, 0x20}

"\x93" "ESTORE " — "RESTORE " something.

>   (gdb) p/x $d2
>   $3 = {u8 = {0x1e, 0x65, 0x31, 0x2d, 0x61, 0x63, 0x33, 0x62}

"\x1e" "e1-ac3b" — from 3c2e2abc-06d4-11e1-ac3b-374f68613e61, a UUID that occurs in various places in the source.

>   (gdb) p/x $d3
>   $4 = {u8 = {0x43, 0x33, 0x37, 0x34, 0x66, 0x36, 0x38, 0x36},

"C" "374f686" — also from 3c2e2abc-06d4-11e1-ac3b-374f68613e61.
(In reply to Jed Davis [:jld] from comment #36)
> What happens if these instructions are replaced with ldr/str?  This looks
> like it's just a 4-byte copy — and maybe there's a bug with how these
> particular instructions are implemented, or how they're restarted after
> being interrupted, or something like that.
> 
> I don't see why that sequence would be async signal unsafe unless there's a
> hardware bug or kernel bug.

Need a hardware debugger to see how Neon registers are changed. Also, according to https://www.kernel.org/doc/Documentation/arm/kernel_mode_neon.txt, it seems kernel does not always preserve/restore Neon registers. I'd like to confirm this from code, but we don't have the source for flatfish.

> Is 0x431e9375 the address of anything interesting?  (`info symbol
> 0x431e9375`, or look at nearby memory)

Tried another run, $r1 gets a different value 0x45dc1a7a which matches to no symbol and the memory content around that address are all 0x00.
If the kernel clobber them then were SOL so let's assume that's not the case and we just have to make sure the sample signal doesn't clobber them. Then why don't we just push and pop them ourselves on the signal entry and exit?
(In reply to Benoit Girard (:BenWa) (Off until June 30th) from comment #38)
> why don't we just push and pop them ourselves on the signal entry and exit?

Add vpush/vpop on the signal entry and exit can still repro. The calling convention states only q4-q7 needs to be preserved, I tried both {q0-q7} and {q4-q7}.
(In reply to Ting-Yu Chou [:ting] from comment #37)
> Also, according to
> https://www.kernel.org/doc/Documentation/arm/kernel_mode_neon.txt, it seems
> kernel does not always preserve/restore Neon registers.

I think that document is referring to using those instructions in the kernel itself, not in user mode.
Attached file vfp-sigtest.diff
I found there's a bug that VFP/NEON state is not preserved around signal handlers:

  http://thread.gmane.org/gmane.linux.ports.arm.kernel/58722/focus=58724

which has been fixed already:

  https://android.googlesource.com/kernel/common.git/+/82c6f5a5b3e91ef4d2fb8725de4b8cf7affd4d61
  https://android.googlesource.com/kernel/common.git/+/5c5cac63851f347d8308d69f1892c4af51d7c1a4

I tried the test code below on flatfish, and can repro "d0 corrupted.":

  https://bugs.launchpad.net/ubuntu/+source/linux-fsl-imx51/+bug/507503/comments/8

Danny, could you please check with partner is the patch included? If not, can they take it? Thanks.
Flags: needinfo?(dliang)
(In reply to Ting-Yu Chou [:ting] from comment #41)
> Created attachment 8447868 [details]
> vfp-sigtest.diff
> 
> I found there's a bug that VFP/NEON state is not preserved around signal
> handlers:
> 
>   http://thread.gmane.org/gmane.linux.ports.arm.kernel/58722/focus=58724
> 
> which has been fixed already:
> 
>  
> https://android.googlesource.com/kernel/common.git/+/
> 82c6f5a5b3e91ef4d2fb8725de4b8cf7affd4d61
>  
> https://android.googlesource.com/kernel/common.git/+/
> 5c5cac63851f347d8308d69f1892c4af51d7c1a4
> 
> I tried the test code below on flatfish, and can repro "d0 corrupted.":
> 
>  
> https://bugs.launchpad.net/ubuntu/+source/linux-fsl-imx51/+bug/507503/
> comments/8
> 
> Danny, could you please check with partner is the patch included? If not,
> can they take it? Thanks.

Hi Jeff,

Can you please help me to check this part? thanks.
Flags: needinfo?(jeff.cy.chuang)
Flags: needinfo?(jeff.cy.chuang)
After checked the kernel sources of flatfish, the patches have been applied and the option CONFIG_VFP is enabled.
Jeff, thanks for the information. Then could you please use attachment 8447868 [details] to reproduce and check how is d0 altered?

The basic idea of the test code can be found from https://bugs.launchpad.net/ubuntu/+source/linux-fsl-imx51/+bug/507503/. Please let me know if there's anything else I should provide.
NI partner.
Flags: needinfo?(jeff.cy.chuang)
Ting, could you provide the following information:
1.Where should I apply the patch to?
2.What is d0? How could I check d0 is altered?
Thank you
(In reply to Jeff Chuang from comment #46)
> 1.Where should I apply the patch to?

B2G/gonk-misc$ touch vfp-sigtest.s
B2G/gonk-misc$ git apply ./vfp-sigtest.diff

After build and download the image, execute vfp-sigtest under adb shell you shall see:

# vfp-sigtest
Waiting for d0 to be corrupted...
d0 corrupted.

> 2.What is d0? How could I check d0 is altered?

d0 is one of Neon register, please refer to http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ch01s03s02.html for details.

Look at vfp-sigtest.s, d0 is modified to float2 in alarm_handler(), but the patch (comment 41) should restore d0 so it is still float1 in main() and does not jump to error. If you have a hardware debugger, simply watch d0, see if kernel restores it correctly to float1. If you don't, maybe set a breakpoint in kernel to verify d0 is restored.
Flags: needinfo?(jeff.cy.chuang)
Flags: needinfo?(dliang)
Ting, Thanks for the information.
I can reproduce it and I will try to setup Kgdb for tracing d0 values.
May I have your suggestions on where should I set the breakpoints for d0 operations?
Could you tell me why d0 restoring is related to gecko profiler?
(In reply to Jeff Chuang from comment #48)
> May I have your suggestions on where should I set the breakpoints for d0
> operations?

The patch should be a good place to start, preserve_vfp_context() and restore_vfp_context().

> Could you tell me why d0 restoring is related to gecko profiler?

Since gecko profiler is done mainly in a signal handler, if Neon registers are not restored correctly, the thread it interrupts will be resumed with incorrect data.
According to ARM APCS, d0-d7 do not need to be preserved, while d8-d15 must be preserved.

How about changing the test target to d8 and give a try?
I tried on d8 by modified the vfp-test sources and the corrupted situation still happened.
We'd better check first about what have been defined and supposed to be done in the kernel for signal handling, especially about how the VFP/Neon registers should be "taken care of" in that case.
(In reply to William Liang from comment #50)
> According to ARM APCS, d0-d7 do not need to be preserved, while d8-d15 must
> be preserved.

That standard applies to procedure calls.  An asynchronous signal can be delivered between any two instructions; therefore, signal handling must restore *all* registers when returning from a signal handler.

(In reply to William Liang from comment #52)
> We'd better check first about what have been defined and supposed to be done
> in the kernel for signal handling, especially about how the VFP/Neon
> registers should be "taken care of" in that case.

For what it's worth, the mainstream Linux kernel has the patches referenced in comment #41, committed in 2010:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5c5cac63851f347d8308d69f1892c4af51d7c1a4
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=82c6f5a5b3e91ef4d2fb8725de4b8cf7affd4d61
Great! Thanks a lot for the valuable information.
I can't get kgdb working on flatfish. I got the messages "The program is not being run" when I let kernel run with the 'c' character. I will try with "printf" instead to check whether the preserve/restore functions are invoking.
Jeff, have you found anything about the issue? Thanks.
Flags: needinfo?(jeff.cy.chuang)
I am still working on it and the results are very strange. printk seems not a good way to show the values of the registers because I always get different values in preserve/restore functions.
Some sample is as follows:
[  219.925949] ===>L192:preserve_vfp_context:ufp->fpregs[0]=400a8ac1400a7426, ufp->fpregs[8]=402848fc40285de4
[  219.946532] ===>L629:vfp_restore_user_hwstate:ufp->fpregs[0]=20646d63657270, ufp->fpregs[8]=0
[  219.956632] ===>L192:preserve_vfp_context:ufp->fpregs[0]=20646d63657270, ufp->fpregs[8]=0
[  219.966353] ===>L629:vfp_restore_user_hwstate:ufp->fpregs[0]=0, ufp->fpregs[8]=0
I am looking for the better way to check the register contents.
Flags: needinfo?(jeff.cy.chuang)
Whiteboard: [Flatfish only] → [flatfish]
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
Sorry, we closed the bug accidentally during triage meeting.
Reopen it.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Whiteboard: [flatfish] → [flatfish][TCP=breakage]
FYI I've seen something similar on my Hamachi but I'm not sure if it's the same problem. Starting profiling on a single application with |./profile.sh start -p <pid>| works but restarting the entire system (i.e. |./profile.sh start|) seems to stop somewhere during the startup process. b2g-info reports only the main b2g process as running (and actively consuming CPU time) but the Nuwa process doesn't get spawned.
Blocks: flatfish
Firefox OS is not being worked on
Status: REOPENED → RESOLVED
Closed: 10 years ago6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: