Closed
Bug 1049866
Opened 11 years ago
Closed 7 years ago
FxOS trunk on Valgrind hard-crashes Flame
Categories
(Firefox OS Graveyard :: Stability, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: jseward, Unassigned)
References
Details
Attachments
(2 files)
FxOS trunk builds and runs fine on Flame. However, when the same
build is run on Valgrind on the Flame, the devices crashes hard and
can only be recovered by power cycling. Despite considerable
investigation as to why this happens, I do not have much information
to go on.
Investigation shows that the crash is GPU related. It happens late in
startup, in CompositorOGL::Initialize(), in
gfx/gecko/layers/opengl/CompositorOGL.cpp. My impression, from
studying Valgrind internal debug logs, is that Gecko uses the GPU
specific (Adreno) driver to compile a shader, and hands the generated
code to the kernel driver so as to get it to run on the GPU. The
phone crashes at that point.
Logcat looks like this:
I/Gecko ( 1160): Attempting load of libEGL.so
D/libEGL ( 1160): loaded /vendor/lib/egl/libEGL_adreno.so
D/libEGL ( 1160): loaded /vendor/lib/egl/libGLESv1_CM_adreno.so
D/libEGL ( 1160): loaded /vendor/lib/egl/libGLESv2_adreno.so
I/Adreno-EGL( 1160): <qeglDrvAPI_eglInitialize:381>: EGL 1.4 QUALCOMM build: ()
I/Adreno-EGL( 1160): OpenGL ES Shader Compiler Version: 20.00.08
I/Adreno-EGL( 1160): Build Date: 03/15/14 Sat
I/Adreno-EGL( 1160): Local Branch:
I/Adreno-EGL( 1160): Remote Branch:
I/Adreno-EGL( 1160): Local Patches:
I/Adreno-EGL( 1160): Reconstruct Branch:
I/HWComposer( 1160): Creating new instance
It finishes at this point.
Very occasionally (once every 20 runs?), the phone does not crash.
Instead the logcat continues like this:
W/Adreno-GSL( 1160): <gsl_ldd_control:397>: ioctl fd 108 code 0xc0140910 (IOCTL_KGSL_RINGBUFFER_ISSUEIBCMDS) failed: errno 71 Protocol error
W/Adreno-GSL( 1160): <log_gpu_snapshot:312>: panel.gpuSnapshotPath is not set.not generating user snapshot
E/GeckoConsole( 1160): OpenGL compositor Initialized Succesfully.
E/GeckoConsole( 1160): Version: OpenGL ES 3.0 V@53.0 AU@ (CL@)
E/GeckoConsole( 1160): Vendor: Qualcomm
E/GeckoConsole( 1160): Renderer: Adreno (TM) 305
E/GeckoConsole( 1160): FBO Texture Target: TEXTURE_2D
Notice the first line. fd 108 is the file descriptor used to communicate
between the user-space driver and the kernel driver for Adreno. What I
interpret this to say is that the information presented to the kernel
driver is somehow incorrect or corrupted. The crash always happens
very shortly after the first ioctl(IOCTL_KGSL_RINGBUFFER_ISSUEIBCMDS)
is performed.
A couple of other reference points:
* this is with the 32-bit ARM port of Valgrind. This is quite stable
and has no problem running Gecko, Firefox OS, Fx for Android on
other targets.
* the exact same source tree builds and runs ok on Nexus 5, both
natively and on Valgrind. Even stranger is that Nexus 5 also
has an Adreno GPU, although I don't know if it is the same model
as on the Flame.
I am mystified.
Reporter | ||
Comment 1•11 years ago
|
||
After a crash, the 'dmesg' output contains this kind of thing:
<3>[ 488.282496] kgsl kgsl-3d0: Proc memcheck-arm-li, ctxt_id 1 ts 2 triggered fault tolerance on global ts 897
<3>[ 488.282511] kgsl kgsl-3d0: STATUS C000C001 | IB1:0E4AA3DC/000000BC | IB2: F800A010/00000000 | RPTR: 00C0 | WPTR: 00FE
<3>[ 488.282525] kgsl kgsl-3d0: |kgsl_device_snapshot| Failed to get GPU active count
<3>[ 488.282561] kgsl kgsl-3d0: |_adreno_ft| MMU fault skipping replay
<3>[ 488.319909] kgsl kgsl-3d0: *EMPTY*
<3>[ 488.323256] kgsl kgsl-3d0: <- fault @ 0E4AA460
<3>[ 488.327750] kgsl kgsl-3d0: *EMPTY*
<2>[ 488.335934] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| GPU PAGE FAULT: addr = E488210 pid = 1160
<2>[ 488.343629] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| context = 0 FSR = 4000000A FSYNR0 = F000001 FSYNR1 = 441442(read fault)
<3>[ 488.354913] kgsl kgsl-3d0: ---- nearby memory ----
<3>[ 488.359656] kgsl kgsl-3d0: *EMPTY*
<3>[ 488.363065] kgsl kgsl-3d0: <- fault @ 0E488210
<3>[ 488.367552] kgsl kgsl-3d0: *EMPTY*
<2>[ 488.371075] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| GPU PAGE FAULT: addr = E488230 pid = 1160
<2>[ 488.379657] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| context = 0 FSR = 4000000A FSYNR0 = E000001 FSYNR1 = 440442(read fault)
which doesn't sound good to me.
Reporter | ||
Comment 2•11 years ago
|
||
root@flame:/data/local # ./busybox-armv7l uname -a
Linux localhost 3.4.0-g30d40a5 #2 SMP PREEMPT Sun Jul 27 19:09:43 CEST 2014 armv7l GNU/Linux
Reporter | ||
Comment 3•11 years ago
|
||
STR are:
B2G (trunk? whatever you get by default) of 23-07-2014
./config.sh flame
.userconfig as follows
export B2G_VALGRIND=1
export NOFTU=1
export DEVICE_DEBUG=1
export DISABLE_JEMALLOC=1
Add the following at the end of gonk-misc/default-gecko-config
ac_add_options --disable-crashreporter
ac_add_options --disable-profiling
ac_add_options --enable-optimize="-g -O"
Then
./build.sh
./flash.sh
phone starts up fine. Then
./run-valgrind.sh
takes several minutes to push debuginfo libxul.so to the phone
and a couple more minutes for Valgrind to get Gecko most of the
way through its startup. But at some point it hangs, and
all 'adb shell' connections to the phone drop (that is the
easy way to see that it has crashed.)
![]() |
||
Comment 4•11 years ago
|
||
This blocks marifuzz from being run on Flames with Valgrind builds (and others using V builds too).
Setting needinfo? from Gecko B2G module owners Vivien and Fabrice as a start. How should we move this forward?
Comment 5•11 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #3)
> STR are:
>
> B2G (trunk? whatever you get by default) of 23-07-2014
>
> ./config.sh flame
>
> .userconfig as follows
> export B2G_VALGRIND=1
> export NOFTU=1
> export DEVICE_DEBUG=1
> export DISABLE_JEMALLOC=1
>
> Add the following at the end of gonk-misc/default-gecko-config
>
> ac_add_options --disable-crashreporter
> ac_add_options --disable-profiling
> ac_add_options --enable-optimize="-g -O"
>
> Then
>
> ./build.sh
> ./flash.sh
>
> phone starts up fine. Then
>
> ./run-valgrind.sh
>
> takes several minutes to push debuginfo libxul.so to the phone
> and a couple more minutes for Valgrind to get Gecko most of the
> way through its startup. But at some point it hangs, and
> all 'adb shell' connections to the phone drop (that is the
> easy way to see that it has crashed.)
Thanks, I have a build ready, I'm investigating now !
Assignee: nobody → lissyx+mozillians
Reporter | ||
Comment 6•11 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #5)
> Thanks, I have a build ready, I'm investigating now !
Excellent. BTW, run-valgrind.sh is a slow way to investigate.
Once you have run it once, to push the debuginfo libxul.so
onto the device, you can re-run much more quickly like this
adb shell stop b2g ; adb shell "B2G_DIR='/data/valgrind-b2g' \
HOSTNAME='b2g' LOGNAME='b2g' COMMAND_PREFIX='/system/bin/valgrind \
-v \
--tool=none --vex-iropt-register-updates=allregs-at-mem-access \
--trace-children=yes' exec /system/bin/b2g.sh"
This is faster because (1) you don't have to re-push libxul.so to
the device and (2) because --tool=none does no instrumentation,
which makes Valgrind faster.
Comment 7•11 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #6)
> (In reply to Alexandre LISSY :gerard-majax from comment #5)
> > Thanks, I have a build ready, I'm investigating now !
>
> Excellent. BTW, run-valgrind.sh is a slow way to investigate.
> Once you have run it once, to push the debuginfo libxul.so
> onto the device, you can re-run much more quickly like this
>
> adb shell stop b2g ; adb shell "B2G_DIR='/data/valgrind-b2g' \
> HOSTNAME='b2g' LOGNAME='b2g' COMMAND_PREFIX='/system/bin/valgrind \
> -v \
> --tool=none --vex-iropt-register-updates=allregs-at-mem-access \
> --trace-children=yes' exec /system/bin/b2g.sh"
>
> This is faster because (1) you don't have to re-push libxul.so to
> the device and (2) because --tool=none does no instrumentation,
> which makes Valgrind faster.
Thanks!
Comment 8•11 years ago
|
||
Kernel dmesg output when the crash happens. Changing the config flag MSM_KGSL_MMU_PAGE_FAULT ("Force the GPU MMU to page fault for unmapped regions") to n makes the device still crashing but I don't have the GPU page fault in dmesg anymore.
Comment 9•11 years ago
|
||
Okay, so it may be related to memory management somewhere. For now, hacking the kernel I have been able to boot and at least get statusbar and background displayed, by just disabling this in Kconfig: CONFIG_KGSL_PER_PROCESS_PAGE_TABLE.
It seems to be booting, but given it's running under Valgrind, it's very slow, of course.
Flags: needinfo?(jseward)
Comment 10•11 years ago
|
||
From Kconfig:
- KGSL_PER_PROCESS_PAGE_TABLE: Enable Per Process page tables for the KGSL driver
The MMU will use per process pagetables when enabled.
A bit below, we can learn that:
- default pagetable sized used by MMU is 0xFFF0000, which is 256M - 64k (MSM_KGSL_PAGE_TABLE_SIZE)
- minimum concurrent pagetables to support is set to 8 by default (MSM_KGSL_PAGE_TABLE_COUNT)
Comment 11•11 years ago
|
||
dmesg output under valgrind after setting CONFIG_KGSL_PER_PROCESS_PAGE_TABLE=n
Comment 12•11 years ago
|
||
Setting
- CONFIG_MSM_KGSL_PAGE_TABLE_COUNT=1
or
- CONFIG_MSM_KGSL_PAGE_TABLE_COUNT=2 and # CONFIG_MSM_KGSL_PAGE_TABLE_SIZE=0x00FF0000
It still crashes badly.
![]() |
||
Comment 13•11 years ago
|
||
(clearing needinfo? from Fabrice/Vivien, seems like this bug has traction moving forward.)
Flags: needinfo?(fabrice)
Flags: needinfo?(21)
Reporter | ||
Comment 14•11 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #9)
> CONFIG_KGSL_PER_PROCESS_PAGE_TABLE=n
This also works for me. With that in place, it starts up and
runs stably, and stayed alive overnight. There is a home screen
with icons and I was able to start and quit the camera app
from that.
For Valgrinding purposes, this seems like a good-enough workaround.
I am not sure what the followup actions should be, though, regarding
whether this change should be pushed into the tree.
Alexandre, thank you for looking into this.
Flags: needinfo?(jseward)
Comment 15•11 years ago
|
||
Unassigning myself from this. Next step implies that someone gets in touch with qc to know why it fails, maybe.
Assignee: lissyx+mozillians → nobody
Comment 16•7 years ago
|
||
Firefox OS is not being worked on
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•