crash in mozilla::ipc::SocketReceiveTask::Run shutting off Bluetooth file transfer in process

RESOLVED FIXED in Firefox 21

Status

defect
--
critical
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: marcia, Assigned: tzimmermann)

Tracking

({crash})

unspecified
B2G C4 (2jan on)
ARM
Gonk (Firefox OS)
Dependency tree / graph

Firefox Tracking Flags

(firefox20 wontfix, firefox21 fixed, b2g18+ affected, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix)

Details

(Whiteboard: [b2g-crash], crash signature)

Attachments

(4 attachments, 12 obsolete attachments)

2.52 KB, patch
tzimmermann
: review+
Details | Diff | Splinter Review
8.34 KB, patch
tzimmermann
: review+
Details | Diff | Splinter Review
5.04 KB, patch
tzimmermann
: review+
Details | Diff | Splinter Review
8.23 KB, patch
tzimmermann
: review+
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-b0854aeb-0d9f-4030-9b29-2836f2130130 .
============================================================= 

unagi, seen while running:

Gecko revision="066b9d7cf30884a001db22bde3ae939c02718062"
Gaia revision=f7f5a0cd17e3d04308cc5850b254947e127122b9

STR:
1. Start a bluetooth file transfer - I was transferring a 25.3 MB music file
2. Stop the transfer by turning off bluetooth on the unagi device.
3. Receive a crash.
Frame 	Module 	Signature 	Source
0 	libxul.so 	mozilla::ipc::SocketReceiveTask::Run 	UnixSocket.cpp:333
1 	libxul.so 	nsThread::ProcessNextEvent 	nsThread.cpp:620
2 	libxul.so 	NS_ProcessNextEvent_P 	nsThreadUtils.cpp:237
3 	libxul.so 	mozilla::ipc::MessagePump::Run 	MessagePump.cpp:82
4 	libxul.so 	MessageLoop::RunInternal 	message_loop.cc:215
5 	libxul.so 	MessageLoop::Run 	message_loop.cc:208
6 	libxul.so 	nsBaseAppShell::Run 	nsBaseAppShell.cpp:163
7 	libxul.so 	nsAppStartup::Run 	nsAppStartup.cpp:290
8 	libxul.so 	XREMain::XRE_mainRun 	nsAppRunner.cpp:3794
9 	libxul.so 	XREMain::XRE_main 	nsAppRunner.cpp:3860
10 	libxul.so 	XRE_main 	nsAppRunner.cpp:3935
11 	b2g 	main 	nsBrowserApp.cpp:164
12 	libc.so 	__libc_init 	libc_init_dynamic.c:114
13 	libc.so 	__cxa_atexit 	atexit.c:99
14 		@0xbea62d45
blocking-b2g: --- → tef?
OS: Android → Gonk (Firefox OS)
Hardware: All → ARM
Whiteboard: [b2g-crash]
Using the QC RIL when this crash occurred.
can we get someone to confirm this reproduces against moz RIL?  if not, then we can send this onward to QC.
hmm..i don't see us in the .so list
I'm using today's Unagi engineering build (w/Marionette enabled).

Sorry, but I understand time is of the essence, here, so I don't have a crash ID right now.

My git commit info is: 2013-01-30 09:18:04
f7fa0cd17e3d04308cc5850b254947e...

Build identifier is: 20130130063101
I tried reproducing it on a few other devices but haven't had any luck hitting it again.
This looks a lot like bug 830704. Adding tzimmerman to cc just in case this does turn out to be repro'able.
I cannot reproduce the problem, but from looking at the code, I have an idea what might happen.

The fix for bug 830704 ensures that the instance of UnixSocketImpl gets deleted after the last pending SocketReceiveTask. I assumed that in ipc/unixsocket/UnixSocket.cpp, line 626, the IO thread would have been stopped. That seems not to be the case. If the IO thread creates a SocketReceiveTask (at line 680+) after the main thread executed line 635, this SocketReceiveTask will access an invalid instance of UnixSocketImpl.

Is there a thread-safe way to stop the IO thread or make it stop watching the socket file descriptor?
Let's track this.  Please nom if we can get a reliable STR.
blocking-b2g: tef? → ---
tracking-b2g18: --- → ?
I can reproduce this issue by repeatedly connecting and disconnecting with my Nokia headset for no more than 40 times. The symptom is exactly the same as this issue.

I took a look at this problem, it's just like Thomas mentioned in comment 8, SocketReceiveTask is still possible to be created even after DeleteInstanceRunnable::run() is executed, so SocketReceiveTask will access an invalid instance of UnixSocketImpl.
Hi Kyle, could you please take a minute to see if this would work? Thanks.
Assignee: nobody → echou
Attachment #710624 - Flags: feedback?(kyle)
Hi Eric,

Is it save to stop the watch the file descriptor from within the main thread while the I/O thread is still listening? I was looking at the problem a few days ago when Kyle cc'd me, and I couldn't find a thread-safe way to do libevent calls. Also, I can still see an opportunity for this bug to happen if the I/O thread is executing OnFileCanReadWithoutBlocking while the main thread dispatches DeleteUnixSocketImpl.
I wrote this patch a few days ago to fix this bug. The idea is to signal the I/O thread that we're about to quit and let it shut down itself and dispatch the DeleteUnixSocketImpl instance The patch has a bug somewhere and doesn't really work, but idea is sound, if think.
Attachment #710629 - Flags: feedback?
(In reply to Thomas Zimmermann [:tzimmermann] from comment #12)
> Hi Eric,
> 
> Is it save to stop the watch the file descriptor from within the main thread
> while the I/O thread is still listening? I was looking at the problem a few
> days ago when Kyle cc'd me, and I couldn't find a thread-safe way to do
> libevent calls. 

Actually, I'm not pretty sure. But since we will always stop watching FDs only in CloseSocket() on main thread and in dtor of UnixSocketImpl, is it still not safe?

> Also, I can still see an opportunity for this bug to happen
> if the I/O thread is executing OnFileCanReadWithoutBlocking while the main
> thread dispatches DeleteUnixSocketImpl.

You're right, it could happen.
Ok, well yet another idea I'm not exactly loving, but...

So we /need/ to know that we're no longer watching the fds before deleting anything. So why not:

1. Make a function to run on the IO thread that will stop the watchers, that also takes the pointer to delete.
2. Make a function to run on the main thread that'll do the deletion.
3. When we run CloseSocket, queue the function in (1) on the IOThread.
4. When the IOThread gets to the task, it'll close out the watchers, then queue the deletion function to run on the main thread.
5. Main thread deletes pointer without worrying about racing.

This way, the order is implicit, doesn't require the complexity of the lock solution, and everything runs on the thread its expected to. Make sense? Am I missing anything?
Comment on attachment 710624 [details] [diff] [review]
patch 1: v1: Stop watching sockets before deleting UnixSocketImpl instance

Review of attachment 710624 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/unixsocket/UnixSocket.cpp
@@ +629,5 @@
>    mImpl = nullptr;
>    // Line it up to be destructed on the IO Thread
>    impl->mConsumer.forget();
>    impl->StopTask();
> +  impl->StopIO();

StopWatchingFileDescriptor ain't threadsafe. :(
Attachment #710624 - Flags: feedback?(kyle) → feedback-
Talked to cjones. StopWatchingFileDescriptor most likely isn't threadsafe, so I don't think that solution will work. I think my linearizing the functions across the threads should do it, but like I said, if I'm missing something, let me know.
(In reply to Kyle Machulis [:kmachulis] [:qdot] from comment #15)
> Ok, well yet another idea I'm not exactly loving, but...
> 
> So we /need/ to know that we're no longer watching the fds before deleting
> anything. So why not:
> 
> 1. Make a function to run on the IO thread that will stop the watchers, that
> also takes the pointer to delete.
> 2. Make a function to run on the main thread that'll do the deletion.
> 3. When we run CloseSocket, queue the function in (1) on the IOThread.
> 4. When the IOThread gets to the task, it'll close out the watchers, then
> queue the deletion function to run on the main thread.
> 5. Main thread deletes pointer without worrying about racing.
> 
> This way, the order is implicit, doesn't require the complexity of the lock
> solution, and everything runs on the thread its expected to. Make sense? Am
> I missing anything?

My patch already does something like that. It sets a flag that will make the I/O thread dispatch a DeleteUnixSocketImpl task to the main tread, which then does the real work. The locking is only for protecting the flag variable. The I/O thread needs to first wake up, though.

Another, much simpler idea is to simply close the socket file descriptor from within the main thread. I'm not sure what the I/O thread does in that case,  but my guess is that we can detect that the socket has been closed and stop watching. We'd need to signal this back to the main thread, which would then dispatch the DeleteUnixSignalImpl task.
(In reply to Thomas Zimmermann [:tzimmermann] from comment #18)
> My patch already does something like that. It sets a flag that will make the
> I/O thread dispatch a DeleteUnixSocketImpl task to the main tread, which
> then does the real work. The locking is only for protecting the flag
> variable. The I/O thread needs to first wake up, though.

Yeah, that's the nice part about having it as a task, we'll know it'll run versus waiting on something incoming to trigger it.
 
> Another, much simpler idea is to simply close the socket file descriptor
> from within the main thread. I'm not sure what the I/O thread does in that
> case,  but my guess is that we can detect that the socket has been closed
> and stop watching. We'd need to signal this back to the main thread, which
> would then dispatch the DeleteUnixSignalImpl task.

Hmm. Where would we watch for the descriptor getting closed? We don't really maintain control at all times since libevent is doing most of the work.
Hey Thomas, are you going to take this bug? :)
Flags: needinfo?(tzimmermann)
Hi Eric,

I can't reproduce it, but I'd give it another try and see how far I get.
Assignee: echou → tzimmermann
Flags: needinfo?(tzimmermann)
(In reply to Thomas Zimmermann [:tzimmermann] from comment #21)
> Hi Eric
> 
> I can't reproduce it, but I'd give it another try and see how far I get.

No problem. Since we've already known what the root cause is, I think it's fine to just upload your patch and we can discuss if it would work. Thanks in advance.
Attachment #712464 - Flags: review?(kyle)
Attachment #712464 - Flags: review?(echou)
Attachment #712466 - Flags: review?(kyle)
Attachment #712466 - Flags: review?(echou)
Attachment #710624 - Attachment is obsolete: true
Attachment #710629 - Attachment is obsolete: true
Attachment #710629 - Flags: feedback?
Attachment #712467 - Flags: review?(kyle)
Attachment #712467 - Flags: review?(echou)
Hey!

It took me several attempts to fix this bug, but I think these patches finally work.

 - The first patch simply adds some locking; nothing special.
 - The second patch converts SocketAcceptTask to use the I/O thread's watch mechanisms, instead of actively polling the listening socket. This is necessary because a SocketAcceptTask might execute after the UnixSocketImpl has already been deleted.
- Finally the third patch fixes the race condition by removing the socket's watchers from within the I/O thread and dispatching a delete operation.
Attachment #712464 - Flags: review?(kyle) → review+
Attachment #712466 - Flags: review?(kyle) → review+
Attachment #712467 - Flags: review?(kyle) → review+
Ok, everything looks good, seems to make sense.
Attachment #712464 - Flags: review?(echou) → review+
Attachment #712466 - Flags: review?(echou) → review+
Attachment #712467 - Flags: review?(echou) → review+
Looks good to me. Thanks, Thomas.
Rebased and added r=
Attachment #712464 - Attachment is obsolete: true
Attachment #712859 - Flags: review+
Rebased and added r=
Attachment #712466 - Attachment is obsolete: true
Attachment #712860 - Attachment is obsolete: true
Attachment #712861 - Flags: review+
Rebased and added r=
Attachment #712467 - Attachment is obsolete: true
Attachment #712862 - Flags: review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/89edfdd1a350
https://hg.mozilla.org/integration/mozilla-inbound/rev/4251e6dd0218
https://hg.mozilla.org/integration/mozilla-inbound/rev/164c9a8f3711

If this ends up getting approved for b2g18 uplift, please set checkin-needed again. My b2g uplift queries ignore the Gaia component and I won't see this otherwise.
Keywords: checkin-needed
Depends on: 841648
Comment on attachment 712859 [details] [diff] [review]
Protect mCurrentTaskIsCanceled by lock [r=qdot,echou]

Review of attachment 712859 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/unixsocket/UnixSocket.cpp
@@ +80,5 @@
>        return;
>      }
>      mTask->Cancel();
>      mTask = nullptr;
> +    MutexAutoLock lock(mLock);

This is not the correct way of using a lock. If we needed a lock (though the patch in bug 845148 shows that it's not needed here) then the lock would need to cover the entirety of the code that relies on having a consistent value for the protected variables. The lock also misses mTask possibly being assigned and read from multiple threads at the same time. I'm going to end up nuking it in my backport for bug 845148.
This is going to need to be rebased for b2g18 uplift on top of bug 845148.
Reopened for uplift.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Blake Kaplan (:mrbkap) from comment #35)
> Comment on attachment 712859 [details] [diff] [review]
> Protect mCurrentTaskIsCanceled by lock [r=qdot,echou]
> 
> Review of attachment 712859 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: ipc/unixsocket/UnixSocket.cpp
> @@ +80,5 @@
> >        return;
> >      }
> >      mTask->Cancel();
> >      mTask = nullptr;
> > +    MutexAutoLock lock(mLock);
> 
> This is not the correct way of using a lock. If we needed a lock (though the
> patch in bug 845148 shows that it's not needed here) then the lock would
> need to cover the entirety of the code that relies on having a consistent
> value for the protected variables. The lock also misses mTask possibly being
> assigned and read from multiple threads at the same time. I'm going to end
> up nuking it in my backport for bug 845148.

mlock only protects mCurrentTaskIsCanceled (and only mCurrentTaskisCanceled), which is signals to the IO thread (and only the IO thread) that the current task has been canceled from the main thread (and only the main thread). You need proper concurrency control for doing this.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #36)
> This is going to need to be rebased for b2g18 uplift on top of bug 845148.

Both patch sets overlap considerably. I cannot rebase the changes in these patches without risking serious regressions. Instead I'd propose to remove the changes for bug 845148 from b2g18, apply the fixes for this bug, and afterwards re-apply bug-845148 fixes.
Posted patch Backout fix 845148 (b2g18) (obsolete) — Splinter Review
Blake, could you re-port your patch set for bug 845148 to b2g18? That would bring both patch sets into the correct order, as bug 845148 is a fix for the patches here. Porting the socket changes is probably much easier now. You should be able to reuse the previous patch to BluetoothOppManager.
Attachment #722285 - Flags: review?(mrbkap)
Attachment #722288 - Attachment description: Wait for incoming connections in UnixSocketImpl [r=qdot,echou] → Wait for incoming connections in UnixSocketImpl [r=qdot,echou] (b2g18)
UNDO. UNDO. 

The "Wait for incoming connections in UnixSocketImpl" takes out the O_NONBLOCK set in SetNonblockingFlags, which not only causes us to lock on exit and completely negates the whole reason the function exists, god only knows what it's doing to the IOThread. Why is that gone? If you don't set the socket non-block, libevent isn't going to work at ALL.
(In reply to Thomas Zimmermann [:tzimmermann] from comment #38)
> mlock only protects mCurrentTaskIsCanceled (and only
> mCurrentTaskisCanceled), which is signals to the IO thread (and only the IO
> thread) that the current task has been canceled from the main thread (and
> only the main thread). You need proper concurrency control for doing this.

As the patch in bug 845148 shows, the lock is actually unnecessary. It's much better to avoid shared mutable state when possible. The patch in that bug does so by using the already thread-safe event loops in order to serialize access to the various data structures. Here is why this lock is not useful:

The reason that threads racing to set shared state is dangerous is to avoid inconsistent state when two states set the given value at the same time. Getting or setting a boolean is an atomic operation: it is either true or false, therefore only locking around the get or set operation does not guarantee anything that the underlying memory model doesn't already give you. The reason that the lock is not sufficient is this code:

    if (mTask) {
      return;
    }
    if (IsCanceled()) {
      return;
    }
    mTask = aTask;

needs to assume that IsCanceled() cannot change after the check. Otherwise, we can end up in a state where mIsCanceled is true and mTask is not in a canceled state. To make the lock work, it would be more correct to hold it above the mTask check (mTask also needs to be read under the same lock) and release it before posting the task to the event loop. Then StopTask would need to hold it for the entirety of its body to avoid racing against EnqueueTask.

Anyway, the approach in bug 845148 avoids all of this entirely by not having any shared mutable state. There cannot be any inconsistency because all accesses are properly serialized on all threads.
Comment on attachment 722285 [details] [diff] [review]
Backout fix 845148 (b2g18)

I'm not going to review this. The model on b2g18 is cleaner and better understood than what we'd be replacing it with. Bug 846615 has patches that sync b2g18 and m-c, which will make getting the rest of the bluetooth patches in easier everywhere.
Attachment #722285 - Flags: review?(mrbkap) → review-
Ok, nevermind on comment 44. accept() got merged back in accidentally.
(In reply to Blake Kaplan (:mrbkap) from comment #45)

The C++ memory model does not give you any guarantees about concurrent memory access. The compiler and processor are free to reorder operations, or split them up into smaller chunks, or not execute them at all. So even though a boolean is either true or false, you'd still need memory barriers to make sure that operations are processed in the correct order, or processed at all.

Anyway, thanks for fixing the undefined memory access and cleaning up this code. From comment 46, I assume that I don't need to back-port the patches from this issue, right?
(In reply to Thomas Zimmermann [:tzimmermann] from comment #48)
> Anyway, thanks for fixing the undefined memory access and cleaning up this
> code. From comment 46, I assume that I don't need to back-port the patches
> from this issue, right?

I think that's true for most of the patches. Once bug 846615 lands on mozilla-central, the models on the two branches will mostly match each other except for patch 2 here ("Wait for incoming connections..."), bug 844705 and some naming differences. What should probably happen is that we should probably merge the names and the second patch here (assuming that comment 44 means that qDot is okay with it) and probably overwrite my changes in the Opp manager with the patch for bug 844705. At that point, the two branches should be merged.

I'm going to re-mark this as FIXED since it's fixed on m-c and we can track the landing of the second patch with branch flags.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
If everyone's in agreement that b2g18 is where it needs to be, can you please set status-b2g18 to fixed on this, bug 842434, and bug 840544? Thanks.

And please post branch-specific patches of whatever still needs to land otherwise (and bonus points for setting checkin-needed on the relevant bugs as this has become a tangled mess to follow).
Talking to Kyle on IRC, it sounds like he agrees with my plan in comment 49. Tim, would you mind wrangling the relevant patches and marking the bugs appropriately?
Any updates here?
This is a port of the second patch to b2g18. The other fixes have already been ported as part of other patches.
Attachment #722285 - Attachment is obsolete: true
Attachment #722287 - Attachment is obsolete: true
Attachment #722288 - Attachment is obsolete: true
Attachment #722291 - Attachment is obsolete: true
Attachment #727773 - Flags: review?(kyle)
Attachment #727773 - Attachment is obsolete: true
Attachment #727773 - Flags: review?(kyle)
Attachment #727881 - Flags: review?(kyle)
Attachment #727881 - Flags: review?(echou)
Attachment #727881 - Flags: review?(kyle) → review+
Comment on attachment 727881 [details] [diff] [review]
Wait for incoming connections in UnixSocketImpl (b2g18)

Review of attachment 727881 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.
Attachment #727881 - Flags: review?(echou) → review+
Updated with review tags. Please uplift.
Attachment #727881 - Attachment is obsolete: true
Attachment #730611 - Flags: review+
This doesn't have approval-b2g18+ or blocking status. Please get approval and re-request checkin.
Keywords: checkin-needed
Target Milestone: --- → B2G C4 (2jan on)
You need to log in before you can comment on or make changes to this bug.