Closed Bug 831313 Opened 9 years ago Closed 9 years ago

Scroll performance testcase with moz-transform causes crash and reboot of Otoro device

Categories

(Firefox OS Graveyard :: General, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

(blocking-b2g:tef+, firefox20 wontfix, firefox21 wontfix, firefox22 fixed, b2g18+ fixed, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 fixed)

RESOLVED FIXED
B2G C4 (2jan on)
blocking-b2g tef+
Tracking Status
firefox20 --- wontfix
firefox21 --- wontfix
firefox22 --- fixed
b2g18 + fixed
b2g18-v1.0.0 --- wontfix
b2g18-v1.0.1 --- fixed

People

(Reporter: martijn.martijn, Assigned: mattwoodrow)

References

()

Details

(Keywords: crash, testcase, Whiteboard: [b2g-crash] QARegressExclude)

Crash Data

Attachments

(3 files)

Perhaps the same as/related to bug 820175.

Steps to reproduce:
- Go to testcase url
- Tap on the "Scroll with moztransform test" link
- Make some pinch zooming movements in and out

Result: crash on my Otoro device
blocking-b2g: --- → tef?
Whiteboard: [b2g-crash]
I've CC'd a few people who could perhaps help us figure out what is causing this.

We discussed this during triage today and while we really don't want to have this type of crasher, the test case is enough of a stress test that we don't think it's something we should block on.
blocking-b2g: tef? → -
tracking-b2g18: --- → +
Can you please provide logcat of this crash?
We had another bug with this same testcase, right?
backtrace from gdb in the b2g process:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 824.856]
jemalloc_crash () at /home/cervantes/hg/mozilla-central/memory/mozjemalloc/jemalloc.c:1582
1582            MOZ_CRASH();
(gdb) bt
#0  jemalloc_crash () at /home/cervantes/hg/mozilla-central/memory/mozjemalloc/jemalloc.c:1582
#1  0x4002b954 in arena_run_reg_dalloc (ptr=<value optimized out>, offset=<value optimized out>) at /home/cervantes/hg/mozilla-central/memory/mozjemalloc/jemalloc.c:3329
#2  arena_dalloc_small (ptr=<value optimized out>, offset=<value optimized out>) at /home/cervantes/hg/mozilla-central/memory/mozjemalloc/jemalloc.c:4540
#3  arena_dalloc (ptr=<value optimized out>, offset=<value optimized out>) at /home/cervantes/hg/mozilla-central/memory/mozjemalloc/jemalloc.c:4668
#4  0x4002cd9a in free (ptr=0x48cf4340) at /home/cervantes/hg/mozilla-central/memory/mozjemalloc/jemalloc.c:6589
#5  0x40026c98 in _ZdlPv (ptr=0x48cf4340) at /home/cervantes/hg/mozilla-central/memory/build/mozmemory_wrap.c:62
#6  0x42b4f540 in gralloc::gpu_context_t::free_impl (this=<value optimized out>, hnd=0x48cf4340) at hardware/qcom/display/libgralloc/gpu.cpp:285
#7  0x42b4f8c6 in gralloc::gpu_context_t::alloc_impl (this=0x4a9fdb70, w=64, h=6, format=1, usage=307, pHandle=0x48cf4340, pStride=0x48cf432c, bufferSize=0) at hardware/qcom/display/libgralloc/gpu.cpp:256
#8  0x42b4f926 in gralloc::gpu_context_t::gralloc_alloc (dev=0x48cf4340, w=<value optimized out>, h=<value optimized out>, format=<value optimized out>, usage=307, pHandle=0x48cf4340, pStride=0x48cf432c) at 
hardware/qcom/display/libgralloc/gpu.cpp:296
#9  0x402ba2e6 in android::GraphicBufferAllocator::alloc (this=<value optimized out>, w=64, h=6, format=1, usage=307, handle=0x48cf4340, stride=0x48cf432c) at frameworks/base/libs/ui/GraphicBufferAllocator.c
pp:102
#10 0x402b9c62 in android::GraphicBuffer::initSize (this=0x48cf4300, w=64, h=6, format=1, reqUsage=307) at frameworks/base/libs/ui/GraphicBuffer.cpp:149
#11 0x402b9fd6 in GraphicBuffer (this=0x48cf4300, w=64, h=6, reqFormat=1, reqUsage=307) at frameworks/base/libs/ui/GraphicBuffer.cpp:62
#12 0x418ed824 in mozilla::layers::GrallocBufferActor::Create (aSize=..., aContent=@0x46cff7b0, aOutHandle=0x46cff794) at /home/cervantes/hg/mozilla-central/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:208
#13 0x418eb4c6 in mozilla::layers::ShadowLayersParent::AllocPGrallocBuffer (this=<value optimized out>, aSize=<value optimized out>, aContent=<value optimized out>, aOutHandle=0x1) at /home/cervantes/hg/mozilla-central/gfx/layers/ipc/ShadowLayersParent.cpp:500
#14 0x41640a4e in mozilla::layers::PLayersParent::OnMessageReceived (this=0x477b7f00, __msg=<value optimized out>, __reply=@0x46cffc0c) at /home/cervantes/git/b2g-device2/B2G/objdir-gecko-dbg/ipc/ipdl/PLayersParent.cpp:452
#15 0x416350a2 in mozilla::layers::PCompositorParent::OnMessageReceived (this=0x437bc6f0, __msg=..., __reply=@0x46cffc0c) at /home/cervantes/git/b2g-device2/B2G/objdir-gecko-dbg/ipc/ipdl/PCompositorParent.cpp:411
#16 0x415e06a4 in mozilla::ipc::SyncChannel::OnDispatchMessage (this=0x437bc6f8, msg=...) at /home/cervantes/hg/mozilla-central/ipc/glue/SyncChannel.cpp:145
#17 0x415de10a in mozilla::ipc::RPCChannel::OnMaybeDequeueOne (this=0x437bc6f8) at /home/cervantes/hg/mozilla-central/ipc/glue/RPCChannel.cpp:400
#18 0x415ad8cc in DispatchToMethod<mozilla::dom::ContentParent, void (mozilla::dom::ContentParent::*)()> (this=<value optimized out>) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/tuple.h:383
#19 RunnableMethod<mozilla::dom::ContentParent, void (mozilla::dom::ContentParent::*)(), Tuple0>::Run (this=<value optimized out>) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/task.h:307
#20 0x415dc566 in mozilla::ipc::RPCChannel::RefCountedTask::Run (this=0x4779daa0) at ../../dist/include/mozilla/ipc/RPCChannel.h:425
#21 mozilla::ipc::RPCChannel::DequeueTask::Run (this=0x4779daa0) at ../../dist/include/mozilla/ipc/RPCChannel.h:448
#22 0x4185da62 in MessageLoop::RunTask (this=0x46cffdd0, task=0x4779daa0) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:333
#23 0x4185e28c in MessageLoop::DeferOrRunPendingTask (this=0x133, pending_task=<value optimized out>) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:341
#24 0x4185efde in MessageLoop::DoWork (this=0x46cffdd0) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:441
#25 0x4185f35a in base::MessagePumpDefault::Run (this=0x4603d880, delegate=0x46cffdd0) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/message_pump_default.cc:23
#26 0x4185e016 in MessageLoop::RunInternal (this=0x46cffdd0) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:215
#27 0x4185e076 in MessageLoop::RunHandler (this=0x46cffdd0) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:208
#28 MessageLoop::Run (this=0x46cffdd0) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/message_loop.cc:182
#29 0x41867fdc in base::Thread::ThreadMain (this=0x46049f40) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/thread.cc:156
#30 0x41875ba2 in ThreadFunc (closure=0x133) at /home/cervantes/hg/mozilla-central/ipc/chromium/src/base/platform_thread_posix.cc:39
#31 0x4005be18 in __thread_entry (func=0x41875b99 <ThreadFunc>, arg=0x46049f40, tls=<value optimized out>) at bionic/libc/bionic/pthread.c:217
#32 0x4005b96c in pthread_create (thread_out=<value optimized out>, attr=0xbee45238, start_routine=0x41875b99 <ThreadFunc>, arg=0x46049f40) at bionic/libc/bionic/pthread.c:357
#33 0x4a30fd20 in ?? ()
Cannot access memory at address 0x0
#34 0x4a30fd20 in ?? ()
Cannot access memory at address 0x0
Crash Signature: [@ jemalloc_crash | arena_run_reg_dalloc | arena_dalloc_small | arena_dalloc | free | _ZdlPv]
That's a very interesting crash stack, but I would also very much appreciate logcat here.  (In fact, I would appreciate if /all/ crashes reported by QA came with logcat and gdb stacks, where feasible.)
The jemalloc assertion indicates that we're freeing an interior pointer or doing some other badness.

This looks to me like it may a bug in the qcom driver.

To wit, hardware/qcom/display/libgralloc/gpu.cpp's gpu_context_t::alloc_impl does:

    err = genlock_create_lock((native_handle_t*)(*pHandle));
    if (err) {
        LOGE("%s: genlock_create_lock failed", __FUNCTION__);
        free_impl(reinterpret_cast<private_handle_t*>(pHandle));
        return err;
    }

free_impl then eventually does |delete hnd|.  The delete is what's causing us to crash.

This is sketchy to me because it seems that frameworks/base/libs/ui/GrahpicBuffer.cpp owns |handle|, not gpu_context_t.  Also gpu_context_t's free_impl doesn't look anything like GraphicsBuffer's free_handle.

I don't see where it's allocated exactly, but my guess would be that handle isn't malloc()'ed, or is an interior pointer into some larger malloc()'ed block.
Attached file catlog
This is a catlog of logcat, but I have doubts it is useful. I didn't get crash stack ids when b2g rebooted.

I also occasionally get that the content process in the browser itself crashes, this is a crash stack id of that:
https://crash-stats.mozilla.com/report/index/d0124680-a215-408c-9922-3ea8b2130117
Renoming: This looks quite bad, and could very well be exploitable.
blocking-b2g: - → tef?
Let's block on this to at least investigate.

Milan, can you help find an assignee?
Assignee: nobody → milan
blocking-b2g: tef? → tef+
Benoit, could you take a look?
Assignee: milan → bjacob
Actually, Benoit is chasing a big regression in 18 - Jeff, can you take a quick look?
Assignee: bjacob → jmuizelaar
Michael, can you get someone from qualcomm to comment on what's going on here?
Flags: needinfo?(mvines)
This is a bug of misusing pointer and pointer to pointer in gpu.cpp. gralloc_alloc_framebuffer_locked() and gralloc_alloc_buffer() both do the following (take gralloc_alloc_framebuffer_locked() as example):

line 116:
    private_handle_t* hnd = new private_handle_t(...);
and then by the end:
    *pHandle = hnd;
where pHandle is of type buffer_handle_t*

Then back to the caller gpu_context_t::alloc_impl():
line 242:
    err = gralloc_alloc_framebuffer(size, usage, pHandle);
...
    err = genlock_create_lock((native_handle_t*)(*pHandle));
    if (err) {
        LOGE("%s: genlock_create_lock failed", __FUNCTION__);
        free_impl(reinterpret_cast<private_handle_t*>(pHandle));
        return err;
    }

It looks like gralloc_alloc_framebuffer_locked() wants to change the pointer in gpu_context_t::alloc_impl() so if genlock_create_lock() fails (like too many FD open), then it frees the private_handle_t instance allocated in gralloc_alloc_framebuffer_locked(). Actually it doesn't because pHandle still points to the value passed in. We have a 100% crash as long as genlock_create_lock() returns non-zero value. We need Qualcomm to fix this bug.
(In reply to Cervantes Yu from comment #13)
>     err = genlock_create_lock((native_handle_t*)(*pHandle));
>     if (err) {
>         LOGE("%s: genlock_create_lock failed", __FUNCTION__);
>         free_impl(reinterpret_cast<private_handle_t*>(pHandle));
I was wrong in comment #13. buffer_handle_t is actually a pointer type. The problem is simpler here: pHandle is pointer to pointer. It should pass *pHandle instead of pHandle to free_impl().
>         return err;
>     }
>
Flags: needinfo?(mvines) → needinfo?(dwilson)
Any news here?
I can reproduce a crash on my otoro with the shared URL. Next up I'll try the suggested patch
Flags: needinfo?(dwilson)
Let us know how it goes with the suggested patch.
-> diego for now
Assignee: jmuizelaar → dwilson
(In reply to Justin Lebar [:jlebar] from comment #8)
> Renoming: This looks quite bad, and could very well be exploitable.

The only reason we're tef+ blocking at this point is because this is believed to be exploitable. Is that still the case?
Misuse of free() is among the most dangerous things one can do.
How are things going with this one, Diego?
Flags: needinfo?(dwilson)
(In reply to Justin Lebar [:jlebar] from comment #20)
> Misuse of free() is among the most dangerous things one can do.

Given that, this should be a blocker for both QC and Mozilla.
NPOTB for now, since we don't think this is a problem in our source tree.
Whiteboard: [b2g-crash] → [b2g-crash][NPOTB]
Whiteboard: [b2g-crash][NPOTB] → [b2g-crash][NPOTB][target 28/2]
Removed [target 28/2] because [NPOTB]
Whiteboard: [b2g-crash][NPOTB][target 28/2] → [b2g-crash][NPOTB]
I applied the suggested patch in  Comment 14 but there's still a crash in the test url.

Next I'll check if it happens in the same place or further along
Flags: needinfo?(dwilson)
(In reply to Diego Wilson [:diego] from comment #25)
> I applied the suggested patch in  Comment 14 but there's still a crash in
> the test url.
> 
It's expected because the crash results from resource outage in the graphics driver. Even we don't crash here the we are not likely to proceed much further. The point is we don't crash because of misuse of free(), which is dangerous.
Looks like the libgralloc patch does solve the "free()" crash. Now I mostly get the "well this is embarassing :(" browser page which I think is what we always want in out-of-mem conditions.

I'll send the patch on its way to CAF.
That being said, I do see a gecko crash sometimes when the ThebesLayer is trying to paint (crash stack attached).

Is there a more graceful way of handling this?
That's the content process crashing right? Don't you get the "well this is embarrassing" browser page for that crash?
Gecko restarts. It's a ShadowThebesLayer crash so I'm guessing it's on the main process.
It's mozilla::layers::BasicShadowableThebesLayer::CreateBuffer, so I'm guessing it's the content process :-)
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #32)
> It's mozilla::layers::BasicShadowableThebesLayer::CreateBuffer, so I'm
> guessing it's the content process :-)

Is bug 834372 a dupe of this bug then?
(In reply to Jason Smith [:jsmith] from comment #33)
> Is bug 834372 a dupe of this bug then?

Seems related but not quite the same. The dimensions in bug 834372 look invalid. The dimensions in this bug are valid:

#2  NS_DebugBreak_P (aSeverity=<value optimized out>, aStr=0xbe872d58 "creating ThebesLayer 'back buffer' failed! width=320, height=465, type=1000", aExpr=<value optimized out>, 
    aFile=<value optimized out>, aLine=460) at /local/mnt/workspace/dwilson/ztecdr/gecko/xpcom/base/nsDebugImpl.cpp:380
The libgralloc gpu.cpp patch has been released here:

https://www.codeaurora.org/gitweb/quic/lf/?p=b2g/build.git;a=commit;h=cee98d7cbfd59c1c4b4379c4b094aebc0d601c82

And should be found in releases AU_LINUX_GECKO_ICS_STRAWBERRY_V1.01.00.01.19.030 or later
Unless you guys want to track the ThebesLayer issue here we can close this bug now
(clearing NPOTB and Diego as assignee, for the ThebesLayer issue that remains in this bug)
Assignee: dwilson → nobody
Whiteboard: [b2g-crash][NPOTB] → [b2g-crash]
Since this is tef+ we'll need an assignee - starting with Roc for delegation.
Assignee: nobody → roc
I think we're not handling OOM well, or something like that.
Assignee: roc → matt.woodrow
It could be the main process (UI) that is creating Shadowable layers to send to the compositor.

This doesn't look like a particularly big allocation, so if it is OOM, then we're likely to have issues in other places too.

We can avoid this particular crash fairly easily, but it might result in some fairly broken rendering. Not sure if that's a big improvement.
How are things going here?
As I said before, the best this will do is replace a crash with broken rendering.
Attachment #727440 - Flags: review?(roc)
https://hg.mozilla.org/mozilla-central/rev/11d3fabf5b4a
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [b2g-crash] → [b2g-crash] QARegressExclude
No Test case creation is needed in moztrap for this issue.
Flags: in-moztrap-
Cannot verify, need steps to blackbox test this issue.
You need to log in before you can comment on or make changes to this bug.