The crash happens eventually on pretty much any nontrivial page/application. The crashes almost always happen during or after submitting a GL job. b2g usually crashes in the GLSL compiler or JS engine. For a variety of reasons, the GL impl is highly suspect. I incidentally discovered that disabling the use-mHwc-to-swap-buffers path in |GonkDisplayICS::SwapBuffers()| makes the crashes less frequent. That suggests to me that something in that path is triggering more of the at-fault code. It would be so nice if we could make ICS b2g builds with ASan enabled ...
3 years ago
I can reproduce this reliably (3/3 so far) with 1. Enable GL debugging and layerscope 2. Add a glFlush() call just before the glFinish() in AfterGLCall() 3. Run a simple test page (presumably this doesn't matter much; here's mine https://pastebin.mozilla.org/6606587 ) 4. Wait for a few frames of the test case to render and then attach layersope Almost immediately there's a crash here Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 2099.2270] mem_unlock_unretain_release_multiple (handles=0xffff0107, n=4259246567) at brcm_usrlib/dag/vmcsx/vcfw/rtos/common/rtos_common_mem.cpp:707 707 header->unlock(); #0 mem_unlock_unretain_release_multiple (handles=0xffff0107, n=4259246567) at brcm_usrlib/dag/vmcsx/vcfw/rtos/common/rtos_common_mem.cpp:707 #1 0xae9a9414 in do_fix_unlock (reason=<value optimized out>, data_=0xaef127d8, specials=0xb3466bec) at brcm_usrlib/dag/vmcsx/middleware/khronos/common/2708/khrn_fmem_4.c:443 #2 alloc_callback (reason=<value optimized out>, data_=0xaef127d8, specials=0xb3466bec) at brcm_usrlib/dag/vmcsx/middleware/khronos/common/2708/khrn_fmem_4.c:322 #3 0xae9aaa4c in khrn_hw_wait () at brcm_usrlib/dag/vmcsx/middleware/khronos/common/2708/khrn_prod_4.c:1202 #4 0xaea4ab84 in glFinish_impl () at brcm_usrlib/dag/vmcsx/middleware/khronos/glxx/glxx_server.c:1638 #5 0xae9a30c4 in glFinish () at brcm_usrlib/dag/vmcsx/interface/khronos/glxx/glxx_client.c:1420 #6 0xb4821c30 in mozilla::gl::GLContext::AfterGLCall (this=0xb6c22000, glFunction=0xb62978dc "void mozilla::gl::GLContext::raw_fClear(GLbitfield)") at /home/cjones/mozilla/b2g/gecko/gfx/gl/GLContext.h:728 #7 0xb48225d8 in mozilla::gl::GLContext::raw_fClear (this=0xb6c22000, mask=16384) at /home/cjones/mozilla/b2g/gecko/gfx/gl/GLContext.h:954 #8 mozilla::gl::GLContext::fClear (this=0xb6c22000, mask=16384) at /home/cjones/mozilla/b2g/gecko/gfx/gl/GLContext.h:960 #9 0xb482d7e8 in mozilla::gl::GLReadTexImageHelper::ReadTexImage ( this=0xaf62d100, aTextureId=<value optimized out>, aTextureTarget=3553, aSize=..., aConfig=-1266377084, aYInvert=172) at /home/cjones/mozilla/b2g/gecko/gfx/gl/GLReadTexImageHelper.cpp:720 #10 0xb4849e84 in mozilla::layers::SenderHelper::SendTextureSource ( aGLContext=0xb6c22000, aLayerRef=0xb0827800, aSource=0xb15e1bd0, aFlipY=<value optimized out>) at /home/cjones/mozilla/b2g/gecko/gfx/layers/LayerScope.cpp:785 [snip] (gdb) p *header Cannot access memory at address 0xfddef1e7 This doesn't happen without the glFlush() in AfterGLCall().
I added consistency assertions to the memory manager here and the assertions pass (sorta). The crash still happens. Working back through the pointer-chasing trail, we get back to a KHRN_FMEM_CALLBACK_DATA_T* that points into a gememalloc mmap'd region. From that gememalloc region we read a KHRN_FMEM_T* box. This memory has been scribbled over with 0xffff00ff, and we end up deref'ing that into garbage and crashing. (Incredibly luckily, 0xffff00ff happens to point at mapped memory, a special ARM page.) That 0xffff00ff looks extremely suspicious: those are bits comprising the color rgba(1.0, 0.0, 1.0, 1.0) in GL's ARGB format, i.e. the pink un-rendered region from bug 1072140! So it seems that either 1. The KHRN_FMEM_CALLBACK_DATA_T* used to point at valid memory, but the mem scribbled over by a QPU job. In this case, a buffer-format mismatch might be a possibility. or 2. The KHRN_FMEM_CALLBACK_DATA_T* points at memory that was freed, then recycled into a renderbuffer First test is to switch to a debug fill color that's not accidentally a valid address!
With the new debugging fill color, we get a similar crash under an eglFlush(). We end up trying to deref a pointer that's 0xffff0000 (i.e. opaque blue in GL RGBA format), and the pointer is within a structure that's allocated from video memory (for reasons I don't yet understand). I noticed that the winning code in the GL challenge included its own allocator implementation, for reasons I didn't full understand. I /thought/ it was to avoid having to implement the gememalloc interface, but it may also have been to fix/work around these corruption bugs. So that's another option, albeit one I'd rather not take (slippery slope).  https://github.com/simonjhall/challenge
> 1. The KHRN_FMEM_CALLBACK_DATA_T* used to point at valid memory, but the mem scribbled over by a QPU job. In this case, a buffer-format mismatch might be a possibility. I'm now quite confident that this is the problem. Spent some time reviewing the allocator (including reimplementing one function I couldn't understand at first) but didn't find any problems that seem related to this. And if I disable freeing this video memory, we of course OOM pretty soon, but the crash is easily repro-able before that. However, if I bump every allocation size up by a factor of 2x, then the crash no longer reprodues. (If (2) were the culprit, the opposite would be more likely.) Let's hope there's just an image-format setting wrong somewhere, or something along those lines ...
WONTFIX'ing because I couldn't resolve the issues in this stack, and there's no long-term support plan for the code.