b2g-process crash due to memory corruption or double-free



5 years ago
4 years ago


(Reporter: cjones, Unassigned)


The crash happens eventually on pretty much any nontrivial page/application.  The crashes almost always happen during or after submitting a GL job.  b2g usually crashes in the GLSL compiler or JS engine.  For a variety of reasons, the GL impl is highly suspect.

I incidentally discovered that disabling the use-mHwc-to-swap-buffers path in |GonkDisplayICS::SwapBuffers()| makes the crashes less frequent.  That suggests to me that something in that path is triggering more of the at-fault code.

It would be so nice if we could make ICS b2g builds with ASan enabled ...
I can reproduce this reliably (3/3 so far) with

 1. Enable GL debugging and layerscope
 2. Add a glFlush() call just before the glFinish() in AfterGLCall()
 3. Run a simple test page (presumably this doesn't matter much; here's mine https://pastebin.mozilla.org/6606587 )
 4. Wait for a few frames of the test case to render and then attach layersope

Almost immediately there's a crash here

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 2099.2270]
mem_unlock_unretain_release_multiple (handles=0xffff0107, n=4259246567)
    at brcm_usrlib/dag/vmcsx/vcfw/rtos/common/rtos_common_mem.cpp:707
707				header->unlock();
#0  mem_unlock_unretain_release_multiple (handles=0xffff0107, n=4259246567)
    at brcm_usrlib/dag/vmcsx/vcfw/rtos/common/rtos_common_mem.cpp:707
#1  0xae9a9414 in do_fix_unlock (reason=<value optimized out>, 
    data_=0xaef127d8, specials=0xb3466bec)
    at brcm_usrlib/dag/vmcsx/middleware/khronos/common/2708/khrn_fmem_4.c:443
#2  alloc_callback (reason=<value optimized out>, data_=0xaef127d8, 
    at brcm_usrlib/dag/vmcsx/middleware/khronos/common/2708/khrn_fmem_4.c:322
#3  0xae9aaa4c in khrn_hw_wait ()
    at brcm_usrlib/dag/vmcsx/middleware/khronos/common/2708/khrn_prod_4.c:1202
#4  0xaea4ab84 in glFinish_impl ()
    at brcm_usrlib/dag/vmcsx/middleware/khronos/glxx/glxx_server.c:1638
#5  0xae9a30c4 in glFinish ()
    at brcm_usrlib/dag/vmcsx/interface/khronos/glxx/glxx_client.c:1420
#6  0xb4821c30 in mozilla::gl::GLContext::AfterGLCall (this=0xb6c22000, 
    glFunction=0xb62978dc "void mozilla::gl::GLContext::raw_fClear(GLbitfield)") at /home/cjones/mozilla/b2g/gecko/gfx/gl/GLContext.h:728
#7  0xb48225d8 in mozilla::gl::GLContext::raw_fClear (this=0xb6c22000, 
    mask=16384) at /home/cjones/mozilla/b2g/gecko/gfx/gl/GLContext.h:954
#8  mozilla::gl::GLContext::fClear (this=0xb6c22000, mask=16384)
    at /home/cjones/mozilla/b2g/gecko/gfx/gl/GLContext.h:960
#9  0xb482d7e8 in mozilla::gl::GLReadTexImageHelper::ReadTexImage (
    this=0xaf62d100, aTextureId=<value optimized out>, aTextureTarget=3553, 
    aSize=..., aConfig=-1266377084, aYInvert=172)
    at /home/cjones/mozilla/b2g/gecko/gfx/gl/GLReadTexImageHelper.cpp:720
#10 0xb4849e84 in mozilla::layers::SenderHelper::SendTextureSource (
    aGLContext=0xb6c22000, aLayerRef=0xb0827800, aSource=0xb15e1bd0, 
    aFlipY=<value optimized out>)
    at /home/cjones/mozilla/b2g/gecko/gfx/layers/LayerScope.cpp:785
(gdb) p *header
Cannot access memory at address 0xfddef1e7

This doesn't happen without the glFlush() in AfterGLCall().
I added consistency assertions to the memory manager here and the assertions pass (sorta).  The crash still happens.

Working back through the pointer-chasing trail, we get back to a KHRN_FMEM_CALLBACK_DATA_T* that points into a gememalloc mmap'd region.  From that gememalloc region we read a KHRN_FMEM_T* box.  This memory has been scribbled over with 0xffff00ff, and we end up deref'ing that into garbage and crashing.  (Incredibly luckily, 0xffff00ff happens to point at mapped memory, a special ARM page.)

That 0xffff00ff looks extremely suspicious: those are bits comprising the color rgba(1.0, 0.0, 1.0, 1.0) in GL's ARGB format, i.e. the pink un-rendered region from bug 1072140!  So it seems that either

 1. The KHRN_FMEM_CALLBACK_DATA_T* used to point at valid memory, but the mem scribbled over by a QPU job.  In this case, a buffer-format mismatch might be a possibility.


 2. The KHRN_FMEM_CALLBACK_DATA_T* points at memory that was freed, then recycled into a renderbuffer

First test is to switch to a debug fill color that's not accidentally a valid address!
With the new debugging fill color, we get a similar crash under an eglFlush().  We end up trying to deref a pointer that's 0xffff0000 (i.e. opaque blue in GL RGBA format), and the pointer is within a structure that's allocated from video memory (for reasons I don't yet understand).

I noticed that the winning code in the GL challenge[1] included its own allocator implementation, for reasons I didn't full understand.  I /thought/ it was to avoid having to implement the gememalloc interface, but it may also have been to fix/work around these corruption bugs.  So that's another option, albeit one I'd rather not take (slippery slope).

[1] https://github.com/simonjhall/challenge
>  1. The KHRN_FMEM_CALLBACK_DATA_T* used to point at valid memory, but the mem scribbled over by a QPU job.  In this case, a buffer-format mismatch might be a possibility.

I'm now quite confident that this is the problem.  Spent some time reviewing the allocator (including reimplementing one function I couldn't understand at first) but didn't find any problems that seem related to this.  And if I disable freeing this video memory, we of course OOM pretty soon, but the crash is easily repro-able before that.  However, if I bump every allocation size up by a factor of 2x, then the crash no longer reprodues.  (If (2) were the culprit, the opposite would be more likely.)

Let's hope there's just an image-format setting wrong somewhere, or something along those lines ...
WONTFIX'ing because I couldn't resolve the issues in this stack, and there's no long-term support plan for the code.
