1049866 - FxOS trunk on Valgrind hard-crashes Flame

Reporter

Description

•

11 years ago

FxOS trunk builds and runs fine on Flame. However, when the same build is run on Valgrind on the Flame, the devices crashes hard and can only be recovered by power cycling. Despite considerable investigation as to why this happens, I do not have much information to go on. Investigation shows that the crash is GPU related. It happens late in startup, in CompositorOGL::Initialize(), in gfx/gecko/layers/opengl/CompositorOGL.cpp. My impression, from studying Valgrind internal debug logs, is that Gecko uses the GPU specific (Adreno) driver to compile a shader, and hands the generated code to the kernel driver so as to get it to run on the GPU. The phone crashes at that point. Logcat looks like this: I/Gecko ( 1160): Attempting load of libEGL.so D/libEGL ( 1160): loaded /vendor/lib/egl/libEGL_adreno.so D/libEGL ( 1160): loaded /vendor/lib/egl/libGLESv1_CM_adreno.so D/libEGL ( 1160): loaded /vendor/lib/egl/libGLESv2_adreno.so I/Adreno-EGL( 1160): <qeglDrvAPI_eglInitialize:381>: EGL 1.4 QUALCOMM build: () I/Adreno-EGL( 1160): OpenGL ES Shader Compiler Version: 20.00.08 I/Adreno-EGL( 1160): Build Date: 03/15/14 Sat I/Adreno-EGL( 1160): Local Branch: I/Adreno-EGL( 1160): Remote Branch: I/Adreno-EGL( 1160): Local Patches: I/Adreno-EGL( 1160): Reconstruct Branch: I/HWComposer( 1160): Creating new instance It finishes at this point. Very occasionally (once every 20 runs?), the phone does not crash. Instead the logcat continues like this: W/Adreno-GSL( 1160): <gsl_ldd_control:397>: ioctl fd 108 code 0xc0140910 (IOCTL_KGSL_RINGBUFFER_ISSUEIBCMDS) failed: errno 71 Protocol error W/Adreno-GSL( 1160): <log_gpu_snapshot:312>: panel.gpuSnapshotPath is not set.not generating user snapshot E/GeckoConsole( 1160): OpenGL compositor Initialized Succesfully. E/GeckoConsole( 1160): Version: OpenGL ES 3.0 V@53.0 AU@ (CL@) E/GeckoConsole( 1160): Vendor: Qualcomm E/GeckoConsole( 1160): Renderer: Adreno (TM) 305 E/GeckoConsole( 1160): FBO Texture Target: TEXTURE_2D Notice the first line. fd 108 is the file descriptor used to communicate between the user-space driver and the kernel driver for Adreno. What I interpret this to say is that the information presented to the kernel driver is somehow incorrect or corrupted. The crash always happens very shortly after the first ioctl(IOCTL_KGSL_RINGBUFFER_ISSUEIBCMDS) is performed. A couple of other reference points: * this is with the 32-bit ARM port of Valgrind. This is quite stable and has no problem running Gecko, Firefox OS, Fx for Android on other targets. * the exact same source tree builds and runs ok on Nexus 5, both natively and on Valgrind. Even stranger is that Nexus 5 also has an Adreno GPU, although I don't know if it is the same model as on the Flame. I am mystified.

Julian Seward [:jseward]

Reporter

Comment 1

•

11 years ago

After a crash, the 'dmesg' output contains this kind of thing: <3>[ 488.282496] kgsl kgsl-3d0: Proc memcheck-arm-li, ctxt_id 1 ts 2 triggered fault tolerance on global ts 897 <3>[ 488.282511] kgsl kgsl-3d0: STATUS C000C001 | IB1:0E4AA3DC/000000BC | IB2: F800A010/00000000 | RPTR: 00C0 | WPTR: 00FE <3>[ 488.282525] kgsl kgsl-3d0: |kgsl_device_snapshot| Failed to get GPU active count <3>[ 488.282561] kgsl kgsl-3d0: |_adreno_ft| MMU fault skipping replay <3>[ 488.319909] kgsl kgsl-3d0: *EMPTY* <3>[ 488.323256] kgsl kgsl-3d0: <- fault @ 0E4AA460 <3>[ 488.327750] kgsl kgsl-3d0: *EMPTY* <2>[ 488.335934] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| GPU PAGE FAULT: addr = E488210 pid = 1160 <2>[ 488.343629] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| context = 0 FSR = 4000000A FSYNR0 = F000001 FSYNR1 = 441442(read fault) <3>[ 488.354913] kgsl kgsl-3d0: ---- nearby memory ---- <3>[ 488.359656] kgsl kgsl-3d0: *EMPTY* <3>[ 488.363065] kgsl kgsl-3d0: <- fault @ 0E488210 <3>[ 488.367552] kgsl kgsl-3d0: *EMPTY* <2>[ 488.371075] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| GPU PAGE FAULT: addr = E488230 pid = 1160 <2>[ 488.379657] kgsl kgsl-3d0: |kgsl_iommu_fault_handler| context = 0 FSR = 4000000A FSYNR0 = E000001 FSYNR1 = 440442(read fault) which doesn't sound good to me.

Julian Seward [:jseward]

Reporter

Comment 2

•

11 years ago

root@flame:/data/local # ./busybox-armv7l uname -a Linux localhost 3.4.0-g30d40a5 #2 SMP PREEMPT Sun Jul 27 19:09:43 CEST 2014 armv7l GNU/Linux

Julian Seward [:jseward]

Reporter

Comment 3

•

11 years ago

STR are: B2G (trunk? whatever you get by default) of 23-07-2014 ./config.sh flame .userconfig as follows export B2G_VALGRIND=1 export NOFTU=1 export DEVICE_DEBUG=1 export DISABLE_JEMALLOC=1 Add the following at the end of gonk-misc/default-gecko-config ac_add_options --disable-crashreporter ac_add_options --disable-profiling ac_add_options --enable-optimize="-g -O" Then ./build.sh ./flash.sh phone starts up fine. Then ./run-valgrind.sh takes several minutes to push debuginfo libxul.so to the phone and a couple more minutes for Valgrind to get Gecko most of the way through its startup. But at some point it hangs, and all 'adb shell' connections to the phone drop (that is the easy way to see that it has crashed.)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 4

•

11 years ago

This blocks marifuzz from being run on Flames with Valgrind builds (and others using V builds too). Setting needinfo? from Gecko B2G module owners Vivien and Fabrice as a start. How should we move this forward?

Blocks: marifuzz

Flags: needinfo?(fabrice)

Flags: needinfo?(21)

:gerard-majax

Comment 5

•

11 years ago

(In reply to Julian Seward [:jseward] from comment #3) > STR are: > > B2G (trunk? whatever you get by default) of 23-07-2014 > > ./config.sh flame > > .userconfig as follows > export B2G_VALGRIND=1 > export NOFTU=1 > export DEVICE_DEBUG=1 > export DISABLE_JEMALLOC=1 > > Add the following at the end of gonk-misc/default-gecko-config > > ac_add_options --disable-crashreporter > ac_add_options --disable-profiling > ac_add_options --enable-optimize="-g -O" > > Then > > ./build.sh > ./flash.sh > > phone starts up fine. Then > > ./run-valgrind.sh > > takes several minutes to push debuginfo libxul.so to the phone > and a couple more minutes for Valgrind to get Gecko most of the > way through its startup. But at some point it hangs, and > all 'adb shell' connections to the phone drop (that is the > easy way to see that it has crashed.) Thanks, I have a build ready, I'm investigating now !

Assignee: nobody → lissyx+mozillians

Julian Seward [:jseward]

Reporter

Comment 6

•

11 years ago

(In reply to Alexandre LISSY :gerard-majax from comment #5) > Thanks, I have a build ready, I'm investigating now ! Excellent. BTW, run-valgrind.sh is a slow way to investigate. Once you have run it once, to push the debuginfo libxul.so onto the device, you can re-run much more quickly like this adb shell stop b2g ; adb shell "B2G_DIR='/data/valgrind-b2g' \ HOSTNAME='b2g' LOGNAME='b2g' COMMAND_PREFIX='/system/bin/valgrind \ -v \ --tool=none --vex-iropt-register-updates=allregs-at-mem-access \ --trace-children=yes' exec /system/bin/b2g.sh" This is faster because (1) you don't have to re-push libxul.so to the device and (2) because --tool=none does no instrumentation, which makes Valgrind faster.

:gerard-majax

Comment 7

•

11 years ago

(In reply to Julian Seward [:jseward] from comment #6) > (In reply to Alexandre LISSY :gerard-majax from comment #5) > > Thanks, I have a build ready, I'm investigating now ! > > Excellent. BTW, run-valgrind.sh is a slow way to investigate. > Once you have run it once, to push the debuginfo libxul.so > onto the device, you can re-run much more quickly like this > > adb shell stop b2g ; adb shell "B2G_DIR='/data/valgrind-b2g' \ > HOSTNAME='b2g' LOGNAME='b2g' COMMAND_PREFIX='/system/bin/valgrind \ > -v \ > --tool=none --vex-iropt-register-updates=allregs-at-mem-access \ > --trace-children=yes' exec /system/bin/b2g.sh" > > This is faster because (1) you don't have to re-push libxul.so to > the device and (2) because --tool=none does no instrumentation, > which makes Valgrind faster. Thanks!

:gerard-majax

Comment 8

•

11 years ago

Attached file GPU PAGE FAULT — Details

Kernel dmesg output when the crash happens. Changing the config flag MSM_KGSL_MMU_PAGE_FAULT ("Force the GPU MMU to page fault for unmapped regions") to n makes the device still crashing but I don't have the GPU page fault in dmesg anymore.

:gerard-majax

Comment 9

•

11 years ago

Okay, so it may be related to memory management somewhere. For now, hacking the kernel I have been able to boot and at least get statusbar and background displayed, by just disabling this in Kconfig: CONFIG_KGSL_PER_PROCESS_PAGE_TABLE. It seems to be booting, but given it's running under Valgrind, it's very slow, of course.

Flags: needinfo?(jseward)

:gerard-majax

Comment 10

•

11 years ago

From Kconfig: - KGSL_PER_PROCESS_PAGE_TABLE: Enable Per Process page tables for the KGSL driver The MMU will use per process pagetables when enabled. A bit below, we can learn that: - default pagetable sized used by MMU is 0xFFF0000, which is 256M - 64k (MSM_KGSL_PAGE_TABLE_SIZE) - minimum concurrent pagetables to support is set to 8 by default (MSM_KGSL_PAGE_TABLE_COUNT)

:gerard-majax

Comment 11

•

11 years ago

Attached file dmesg.valgrind4.txt — Details

dmesg output under valgrind after setting CONFIG_KGSL_PER_PROCESS_PAGE_TABLE=n

:gerard-majax

Comment 12

•

11 years ago

Setting - CONFIG_MSM_KGSL_PAGE_TABLE_COUNT=1 or - CONFIG_MSM_KGSL_PAGE_TABLE_COUNT=2 and # CONFIG_MSM_KGSL_PAGE_TABLE_SIZE=0x00FF0000 It still crashes badly.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 13

•

11 years ago

(clearing needinfo? from Fabrice/Vivien, seems like this bug has traction moving forward.)

Flags: needinfo?(fabrice)

Flags: needinfo?(21)

Julian Seward [:jseward]

Reporter

Comment 14

•

11 years ago

(In reply to Alexandre LISSY :gerard-majax from comment #9) > CONFIG_KGSL_PER_PROCESS_PAGE_TABLE=n This also works for me. With that in place, it starts up and runs stably, and stayed alive overnight. There is a home screen with icons and I was able to start and quit the camera app from that. For Valgrinding purposes, this seems like a good-enough workaround. I am not sure what the followup actions should be, though, regarding whether this change should be pushed into the tree. Alexandre, thank you for looking into this.

Flags: needinfo?(jseward)

:gerard-majax

Comment 15

•

11 years ago

Unassigning myself from this. Next step implies that someone gets in touch with qc to know why it fails, maybe.

Assignee: lissyx+mozillians → nobody

BMO Automation

Comment 16

•

7 years ago

Firefox OS is not being worked on

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WONTFIX

Bugzilla

FxOS trunk on Valgrind hard-crashes Flame

Categories

(Firefox OS Graveyard :: Stability, defect)

Tracking

(Not tracked)

People

(Reporter: jseward, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type

GPU PAGE FAULT 11 years ago :gerard-majax 431 bytes, text/plain		Details
dmesg.valgrind4.txt 11 years ago :gerard-majax 370.69 KB, text/plain		Details