Closed Bug 1049866 Opened 10 years ago Closed 6 years ago

FxOS trunk on Valgrind hard-crashes Flame

Categories

(Firefox OS Graveyard :: Stability, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jseward, Unassigned)

References

Details

Attachments

(2 files)

FxOS trunk builds and runs fine on Flame.  However, when the same
build is run on Valgrind on the Flame, the devices crashes hard and
can only be recovered by power cycling.  Despite considerable
investigation as to why this happens, I do not have much information
to go on.

Investigation shows that the crash is GPU related.  It happens late in
startup, in CompositorOGL::Initialize(), in
gfx/gecko/layers/opengl/CompositorOGL.cpp.  My impression, from
studying Valgrind internal debug logs, is that Gecko uses the GPU
specific (Adreno) driver to compile a shader, and hands the generated
code to the kernel driver so as to get it to run on the GPU.  The
phone crashes at that point.

Logcat looks like this:

  I/Gecko   ( 1160): Attempting load of libEGL.so
  D/libEGL  ( 1160): loaded /vendor/lib/egl/libEGL_adreno.so
  D/libEGL  ( 1160): loaded /vendor/lib/egl/libGLESv1_CM_adreno.so
  D/libEGL  ( 1160): loaded /vendor/lib/egl/libGLESv2_adreno.so
  I/Adreno-EGL( 1160): <qeglDrvAPI_eglInitialize:381>: EGL 1.4 QUALCOMM build:  ()
  I/Adreno-EGL( 1160): OpenGL ES Shader Compiler Version: 20.00.08
  I/Adreno-EGL( 1160): Build Date: 03/15/14 Sat
  I/Adreno-EGL( 1160): Local Branch: 
  I/Adreno-EGL( 1160): Remote Branch: 
  I/Adreno-EGL( 1160): Local Patches: 
  I/Adreno-EGL( 1160): Reconstruct Branch: 
  I/HWComposer( 1160): Creating new instance

It finishes at this point.

Very occasionally (once every 20 runs?), the phone does not crash.
Instead the logcat continues like this:

  W/Adreno-GSL( 1160): <gsl_ldd_control:397>: ioctl fd 108 code 0xc0140910 (IOCTL_KGSL_RINGBUFFER_ISSUEIBCMDS) failed: errno 71 Protocol error
  W/Adreno-GSL( 1160): <log_gpu_snapshot:312>: panel.gpuSnapshotPath is not set.not generating user snapshot
  E/GeckoConsole( 1160): OpenGL compositor Initialized Succesfully.
  E/GeckoConsole( 1160): Version: OpenGL ES 3.0 V@53.0 AU@  (CL@)
  E/GeckoConsole( 1160): Vendor: Qualcomm
  E/GeckoConsole( 1160): Renderer: Adreno (TM) 305
  E/GeckoConsole( 1160): FBO Texture Target: TEXTURE_2D

Notice the first line.  fd 108 is the file descriptor used to communicate
between the user-space driver and the kernel driver for Adreno.  What I
interpret this to say is that the information presented to the kernel
driver is somehow incorrect or corrupted.  The crash always happens 
very shortly after the first ioctl(IOCTL_KGSL_RINGBUFFER_ISSUEIBCMDS)
is performed.

A couple of other reference points:

* this is with the 32-bit ARM port of Valgrind.  This is quite stable
  and has no problem running Gecko, Firefox OS, Fx for Android on
  other targets.

* the exact same source tree builds and runs ok on Nexus 5, both
  natively and on Valgrind.  Even stranger is that Nexus 5 also
  has an Adreno GPU, although I don't know if it is the same model
  as on the Flame.

I am mystified.
After a crash, the 'dmesg' output contains this kind of thing:

<3>[  488.282496] kgsl kgsl-3d0: Proc memcheck-arm-li, ctxt_id 1 ts 2 triggered fault tolerance on global ts 897
<3>[  488.282511] kgsl kgsl-3d0: STATUS C000C001 | IB1:0E4AA3DC/000000BC | IB2: F800A010/00000000 | RPTR: 00C0 | WPTR: 00FE
<3>[  488.282525] kgsl kgsl-3d0: |kgsl_device_snapshot| Failed to get GPU active count
<3>[  488.282561] kgsl kgsl-3d0: |_adreno_ft| MMU fault skipping replay
<3>[  488.319909] kgsl kgsl-3d0: *EMPTY*
<3>[  488.323256] kgsl kgsl-3d0:  <- fault @ 0E4AA460
<3>[  488.327750] kgsl kgsl-3d0: *EMPTY*
<2>[  488.335934] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| GPU PAGE FAULT: addr = E488210 pid = 1160
<2>[  488.343629] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| context = 0 FSR = 4000000A FSYNR0 = F000001 FSYNR1 = 441442(read fault)
<3>[  488.354913] kgsl kgsl-3d0: ---- nearby memory ----
<3>[  488.359656] kgsl kgsl-3d0: *EMPTY*
<3>[  488.363065] kgsl kgsl-3d0:  <- fault @ 0E488210
<3>[  488.367552] kgsl kgsl-3d0: *EMPTY*
<2>[  488.371075] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| GPU PAGE FAULT: addr = E488230 pid = 1160
<2>[  488.379657] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| context = 0 FSR = 4000000A FSYNR0 = E000001 FSYNR1 = 440442(read fault)

which doesn't sound good to me.
root@flame:/data/local # ./busybox-armv7l uname -a                             
Linux localhost 3.4.0-g30d40a5 #2 SMP PREEMPT Sun Jul 27 19:09:43 CEST 2014 armv7l GNU/Linux
STR are:

B2G (trunk? whatever you get by default) of 23-07-2014

./config.sh flame

.userconfig as follows
  export B2G_VALGRIND=1
  export NOFTU=1
  export DEVICE_DEBUG=1
  export DISABLE_JEMALLOC=1

Add the following at the end of gonk-misc/default-gecko-config

  ac_add_options --disable-crashreporter
  ac_add_options --disable-profiling
  ac_add_options --enable-optimize="-g -O"

Then

  ./build.sh
  ./flash.sh

phone starts up fine.  Then

  ./run-valgrind.sh

takes several minutes to push debuginfo libxul.so to the phone
and a couple more minutes for Valgrind to get Gecko most of the
way through its startup.  But at some point it hangs, and 
all 'adb shell' connections to the phone drop (that is the
easy way to see that it has crashed.)
This blocks marifuzz from being run on Flames with Valgrind builds (and others using V builds too).

Setting needinfo? from Gecko B2G module owners Vivien and Fabrice as a start. How should we move this forward?
Blocks: marifuzz
Flags: needinfo?(fabrice)
Flags: needinfo?(21)
(In reply to Julian Seward [:jseward] from comment #3)
> STR are:
> 
> B2G (trunk? whatever you get by default) of 23-07-2014
> 
> ./config.sh flame
> 
> .userconfig as follows
>   export B2G_VALGRIND=1
>   export NOFTU=1
>   export DEVICE_DEBUG=1
>   export DISABLE_JEMALLOC=1
> 
> Add the following at the end of gonk-misc/default-gecko-config
> 
>   ac_add_options --disable-crashreporter
>   ac_add_options --disable-profiling
>   ac_add_options --enable-optimize="-g -O"
> 
> Then
> 
>   ./build.sh
>   ./flash.sh
> 
> phone starts up fine.  Then
> 
>   ./run-valgrind.sh
> 
> takes several minutes to push debuginfo libxul.so to the phone
> and a couple more minutes for Valgrind to get Gecko most of the
> way through its startup.  But at some point it hangs, and 
> all 'adb shell' connections to the phone drop (that is the
> easy way to see that it has crashed.)

Thanks, I have a build ready, I'm investigating now !
Assignee: nobody → lissyx+mozillians
(In reply to Alexandre LISSY :gerard-majax from comment #5)
> Thanks, I have a build ready, I'm investigating now !

Excellent.  BTW, run-valgrind.sh is a slow way to investigate.
Once you have run it once, to push the debuginfo libxul.so
onto the device, you can re-run much more quickly like this

adb shell stop b2g ; adb shell "B2G_DIR='/data/valgrind-b2g' \
   HOSTNAME='b2g' LOGNAME='b2g' COMMAND_PREFIX='/system/bin/valgrind \
   -v \
   --tool=none --vex-iropt-register-updates=allregs-at-mem-access \
   --trace-children=yes' exec /system/bin/b2g.sh"

This is faster because (1) you don't have to re-push libxul.so to
the device and (2) because --tool=none does no instrumentation,
which makes Valgrind faster.
(In reply to Julian Seward [:jseward] from comment #6)
> (In reply to Alexandre LISSY :gerard-majax from comment #5)
> > Thanks, I have a build ready, I'm investigating now !
> 
> Excellent.  BTW, run-valgrind.sh is a slow way to investigate.
> Once you have run it once, to push the debuginfo libxul.so
> onto the device, you can re-run much more quickly like this
> 
> adb shell stop b2g ; adb shell "B2G_DIR='/data/valgrind-b2g' \
>    HOSTNAME='b2g' LOGNAME='b2g' COMMAND_PREFIX='/system/bin/valgrind \
>    -v \
>    --tool=none --vex-iropt-register-updates=allregs-at-mem-access \
>    --trace-children=yes' exec /system/bin/b2g.sh"
> 
> This is faster because (1) you don't have to re-push libxul.so to
> the device and (2) because --tool=none does no instrumentation,
> which makes Valgrind faster.

Thanks!
Attached file GPU PAGE FAULT
Kernel dmesg output when the crash happens. Changing the config flag MSM_KGSL_MMU_PAGE_FAULT ("Force the GPU MMU to page fault for unmapped regions") to n makes the device still crashing but I don't have the GPU page fault in dmesg anymore.
Okay, so it may be related to memory management somewhere. For now, hacking the kernel I have been able to boot and at least get statusbar and background displayed, by just disabling this in Kconfig: CONFIG_KGSL_PER_PROCESS_PAGE_TABLE.

It seems to be booting, but given it's running under Valgrind, it's very slow, of course.
Flags: needinfo?(jseward)
From Kconfig:
 - KGSL_PER_PROCESS_PAGE_TABLE: Enable Per Process page tables for the KGSL driver
   The MMU will use per process pagetables when enabled.

A bit below, we can learn that:
 - default pagetable sized used by MMU is 0xFFF0000, which is 256M - 64k (MSM_KGSL_PAGE_TABLE_SIZE)
 - minimum concurrent pagetables to support is set to 8 by default (MSM_KGSL_PAGE_TABLE_COUNT)
Attached file dmesg.valgrind4.txt
dmesg output under valgrind after setting CONFIG_KGSL_PER_PROCESS_PAGE_TABLE=n
Setting
 - CONFIG_MSM_KGSL_PAGE_TABLE_COUNT=1
or
 - CONFIG_MSM_KGSL_PAGE_TABLE_COUNT=2 and # CONFIG_MSM_KGSL_PAGE_TABLE_SIZE=0x00FF0000

It still crashes badly.
(clearing needinfo? from Fabrice/Vivien, seems like this bug has traction moving forward.)
Flags: needinfo?(fabrice)
Flags: needinfo?(21)
(In reply to Alexandre LISSY :gerard-majax from comment #9)
> CONFIG_KGSL_PER_PROCESS_PAGE_TABLE=n

This also works for me.  With that in place, it starts up and
runs stably, and stayed alive overnight.  There is a home screen
with icons and I was able to start and quit the camera app 
from that.

For Valgrinding purposes, this seems like a good-enough workaround.
I am not sure what the followup actions should be, though, regarding
whether this change should be pushed into the tree.

Alexandre, thank you for looking into this.
Flags: needinfo?(jseward)
Unassigning myself from this. Next step implies that someone gets in touch with qc to know why it fails, maybe.
Assignee: lissyx+mozillians → nobody
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: