Closed Bug 840294 Opened 7 years ago Closed 7 years ago

FxOS Desktop debug on m-c crashes in IOThread

Categories

(Firefox OS Graveyard :: General, defect)

x86_64
Linux
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: qdot, Assigned: qdot)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash])

Crash Data

Attachments

(1 file)

When trying to get stacks for bug 840286, I found that running gdb on FxOS Desktop on current m-c (not b2g-18) crashes on the IOThread issue. (Stack coming soon)
Severity: normal → critical
Keywords: crash
Whiteboard: [b2g-crash]
Stack:

###!!! ABORT: file /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_pump_libevent.cc, line 155
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x0188E6B3]
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x0187F399]
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x01880AB7]
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x01880BE0]
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x0188E0BF]
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x0187F55C]
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x0187F584]
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x01884E35]
UNKNOWN [/share/code/mozbuild/mozilla-central/obj-debug/dist/bin/libxul.so +0x0188EA80]
UNKNOWN [/lib/x86_64-linux-gnu/libpthread.so.0 +0x00007E9A]
clone+0x0000006D [/lib/x86_64-linux-gnu/libc.so.6 +0x000F3CBD]
###!!! ABORT: file /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_pump_libevent.cc, line 155

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe6740700 (LWP 25213)]
mozalloc_abort (msg=<optimized out>) at /share/code/mozbuild/mozilla-central/memory/mozalloc/mozalloc_abort.cpp:30
30          MOZ_CRASH();
(gdb) bt
#0  mozalloc_abort (msg=<optimized out>) at /share/code/mozbuild/mozilla-central/memory/mozalloc/mozalloc_abort.cpp:30
#1  0x00007ffff3a06098 in Abort (aMsg=0x7fffe673f58c "###!!! ABORT: file /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_pump_libevent.cc, line 155") at /share/code/mozbuild/mozilla-central/xpcom/base/nsDebugImpl.cpp:422
#2  NS_DebugBreak_P (aSeverity=<optimized out>, aStr=<optimized out>, aExpr=0x0, aFile=0x7ffff4635ec2 "/share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_pump_libevent.cc", aLine=155) at /share/code/mozbuild/mozilla-central/xpcom/base/nsDebugImpl.cpp:409
#3  0x00007ffff3a2c1f2 in mozilla::Logger::~Logger (this=0x7fffe673fa10, __in_chrg=<optimized out>) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/logging.cc:47
#4  0x00007ffff3a3b6b3 in ~LogWrapper (this=0x7fffe673fa10, __in_chrg=<optimized out>) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/logging.h:59
#5  base::MessagePumpLibevent::WatchFileDescriptor (this=0x7fffe0001000, fd=-1, persistent=false, mode=base::MessagePumpLibevent::WATCH_WRITE, controller=0x127c458, delegate=0x127c420)
    at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_pump_libevent.cc:155
#6  0x00007ffff3a2c399 in MessageLoop::RunTask (this=0x7fffe673fcd8, task=0x1589010) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:333
#7  0x00007ffff3a2dab7 in MessageLoop::DeferOrRunPendingTask (this=<optimized out>, pending_task=...) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:341
#8  0x00007ffff3a2dbe0 in DoWork (this=<optimized out>) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:441
#9  MessageLoop::DoWork (this=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:420
#10 0x00007ffff3a3b0bf in base::MessagePumpLibevent::Run (this=0x7fffe0001000, delegate=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_pump_libevent.cc:311
#11 0x00007ffff3a2c55c in MessageLoop::RunInternal (this=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:215
#12 0x00007ffff3a2c584 in RunHandler (this=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:208
#13 MessageLoop::Run (this=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:182
#14 0x00007ffff3a31e35 in base::Thread::ThreadMain (this=0x6dea40) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/thread.cc:156
#15 0x00007ffff3a3ba80 in ThreadFunc (closure=<optimized out>) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/platform_thread_posix.cc:39
#16 0x00007ffff73c4e9a in start_thread (arg=0x7fffe6740700) at pthread_create.c:308
#17 0x00007ffff70f1cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112


Most obvious thing from stack: WatchFileDescriptor's fd=-1. That can't be good.
Ok, so looks like this is RIL related (as I removed --enable-b2g-bt completely). I added a MOZ_ASSERT a couple of places, got this more helpful stack back:

Assertion failure: aFd >= 0, at /share/code/mozbuild/mozilla-central/ipc/unixsocket/UnixSocket.cpp:703

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffe6740700 (LWP 31683)]
0x00007ffff39b38fe in mozilla::ipc::UnixSocketImpl::OnFileCanWriteWithoutBlocking (this=0x127c160, aFd=-1) at /share/code/mozbuild/mozilla-central/ipc/unixsocket/UnixSocket.cpp:703
703       MOZ_ASSERT(aFd >= 0);
(gdb) bt
#0  0x00007ffff39b38fe in mozilla::ipc::UnixSocketImpl::OnFileCanWriteWithoutBlocking (this=0x127c160, aFd=-1) at /share/code/mozbuild/mozilla-central/ipc/unixsocket/UnixSocket.cpp:703
#1  0x00007ffff3a2c3e1 in MessageLoop::RunTask (this=0x7fffe673fcd8, task=0x1478ff0) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:333
#2  0x00007ffff3a2daff in MessageLoop::DeferOrRunPendingTask (this=<optimized out>, pending_task=...) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:341
#3  0x00007ffff3a2dc28 in DoWork (this=<optimized out>) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:441
#4  MessageLoop::DoWork (this=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:420
#5  0x00007ffff3a3b107 in base::MessagePumpLibevent::Run (this=0x7fffe0001000, delegate=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_pump_libevent.cc:311
#6  0x00007ffff3a2c5a4 in MessageLoop::RunInternal (this=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:215
#7  0x00007ffff3a2c5cc in RunHandler (this=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:208
#8  MessageLoop::Run (this=0x7fffe673fcd8) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/message_loop.cc:182
#9  0x00007ffff3a31e7d in base::Thread::ThreadMain (this=0x6dea40) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/thread.cc:156
#10 0x00007ffff3a3bac8 in ThreadFunc (closure=<optimized out>) at /share/code/mozbuild/mozilla-central/ipc/chromium/src/base/platform_thread_posix.cc:39
#11 0x00007ffff73c4e9a in start_thread (arg=0x7fffe6740700) at pthread_create.c:308
#12 0x00007ffff70f1cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#13 0x0000000000000000 in ?? ()

So RIL is somehow sending data even when it doesn't have a socket to open.

Also, this crash happens all the time, not just in gdb. It was just flying by too fast for me to see it in the non-gdb logs.
Summary: FxOS Desktop debug on m-c crashes in gdb → FxOS Desktop debug on m-c crashes in IOThread
Duplicate of this bug: 840286
This patch adds a check to make sure we aren't trying to send data to a non-opened RIL socket. It also adds a couple of asserts to blow up earlier if this happens elsewhere.

Feel free to r- this if you feel we should be doing this lower than where I'm putting this check, I can try to figure out some way to do it in UnixSocket too. Mainly just want to get the review process started now since I'm PTO tomorrow.
Attachment #712725 - Flags: review?(vyang)
It's similar to bug 805754.
Crash Signature: [@ mozalloc_abort(char const*) | NS_DebugBreak_P | mozilla::Logger::~Logger ]
Agreeing with comment 34 on bug 805754. Not similar.
Returning to normal. This is not critical since I don't think it'll happen on b2g-18.
Severity: critical → normal
Comment on attachment 712725 [details] [diff] [review]
Patch 1 (v1) - Check RIL validity before writing to socket

Thomas, since vicamo is out, can you take a look at this? Same idea applies, if you think it should be moved to UnixSocket just r- and let me know.
Attachment #712725 - Flags: review?(vyang) → review?(tzimmermann)
Comment on attachment 712725 [details] [diff] [review]
Patch 1 (v1) - Check RIL validity before writing to socket

Review of attachment 712725 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.

::: dom/system/gonk/SystemWorkerManager.cpp
@@ +449,5 @@
>                                      UnixSocketRawData* aRaw)
>  {
>    if ((gInstance->mRilConsumers.Length() <= aClientId) ||
> +      !gInstance->mRilConsumers[aClientId] ||
> +      gInstance->mRilConsumers[aClientId]->GetConnectionStatus() != SOCKET_CONNECTED) {

Just some nitpicking: could it happen that the socket is in the state SOCKET_CONNECTING when this line gets executed? My impression is that we should be able to send data (as in: add it to the send queue) in this case.
Attachment #712725 - Flags: review?(tzimmermann) → review+
Yeah, at that point we should be ok to add to the queue since we'll be CONNECTED by the time the task ends. I'll add that, update patch, and land.

(In reply to Thomas Zimmermann [:tzimmermann] from comment #9)
> Comment on attachment 712725 [details] [diff] [review]
> Patch 1 (v1) - Check RIL validity before writing to socket
> 
> Review of attachment 712725 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good.
> 
> ::: dom/system/gonk/SystemWorkerManager.cpp
> @@ +449,5 @@
> >                                      UnixSocketRawData* aRaw)
> >  {
> >    if ((gInstance->mRilConsumers.Length() <= aClientId) ||
> > +      !gInstance->mRilConsumers[aClientId] ||
> > +      gInstance->mRilConsumers[aClientId]->GetConnectionStatus() != SOCKET_CONNECTED) {
> 
> Just some nitpicking: could it happen that the socket is in the state
> SOCKET_CONNECTING when this line gets executed? My impression is that we
> should be able to send data (as in: add it to the send queue) in this case.
Actually, I'm going to leave as is for the moment. CONNECTING does not imply the connection will be successful, and since this queues work via tasks which we'd have to cancel as well as queue later destruction of the UnixSocketImpl (which is a followup I'll be filing here in a sec), we might as well just wait 'til we're actually connected.
No longer depends on: 841925
https://hg.mozilla.org/mozilla-central/rev/87ac03700d5d
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.