Closed
Bug 836523
Opened 12 years ago
Closed 12 years ago
crash in mozilla::ipc::SocketReceiveTask::Run shutting off Bluetooth file transfer in process
Categories
(Firefox OS Graveyard :: Gaia::Bluetooth, defect)
Tracking
(firefox20 wontfix, firefox21 fixed, b2g18+ affected, b2g18-v1.0.0 wontfix, b2g18-v1.0.1 wontfix)
RESOLVED
FIXED
B2G C4 (2jan on)
People
(Reporter: marcia, Assigned: tzimmermann)
References
Details
(Keywords: crash, Whiteboard: [b2g-crash])
Crash Data
Attachments
(4 files, 12 obsolete files)
2.52 KB,
patch
|
tzimmermann
:
review+
|
Details | Diff | Splinter Review |
8.34 KB,
patch
|
tzimmermann
:
review+
|
Details | Diff | Splinter Review |
5.04 KB,
patch
|
tzimmermann
:
review+
|
Details | Diff | Splinter Review |
8.23 KB,
patch
|
tzimmermann
:
review+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-b0854aeb-0d9f-4030-9b29-2836f2130130 .
=============================================================
unagi, seen while running:
Gecko revision="066b9d7cf30884a001db22bde3ae939c02718062"
Gaia revision=f7f5a0cd17e3d04308cc5850b254947e127122b9
STR:
1. Start a bluetooth file transfer - I was transferring a 25.3 MB music file
2. Stop the transfer by turning off bluetooth on the unagi device.
3. Receive a crash.
Reporter | ||
Comment 1•12 years ago
|
||
Frame Module Signature Source
0 libxul.so mozilla::ipc::SocketReceiveTask::Run UnixSocket.cpp:333
1 libxul.so nsThread::ProcessNextEvent nsThread.cpp:620
2 libxul.so NS_ProcessNextEvent_P nsThreadUtils.cpp:237
3 libxul.so mozilla::ipc::MessagePump::Run MessagePump.cpp:82
4 libxul.so MessageLoop::RunInternal message_loop.cc:215
5 libxul.so MessageLoop::Run message_loop.cc:208
6 libxul.so nsBaseAppShell::Run nsBaseAppShell.cpp:163
7 libxul.so nsAppStartup::Run nsAppStartup.cpp:290
8 libxul.so XREMain::XRE_mainRun nsAppRunner.cpp:3794
9 libxul.so XREMain::XRE_main nsAppRunner.cpp:3860
10 libxul.so XRE_main nsAppRunner.cpp:3935
11 b2g main nsBrowserApp.cpp:164
12 libc.so __libc_init libc_init_dynamic.c:114
13 libc.so __cxa_atexit atexit.c:99
14 @0xbea62d45
Updated•12 years ago
|
blocking-b2g: --- → tef?
OS: Android → Gonk (Firefox OS)
Hardware: All → ARM
Whiteboard: [b2g-crash]
Reporter | ||
Comment 2•12 years ago
|
||
Using the QC RIL when this crash occurred.
Comment 3•12 years ago
|
||
can we get someone to confirm this reproduces against moz RIL? if not, then we can send this onward to QC.
Comment 4•12 years ago
|
||
hmm..i don't see us in the .so list
Comment 5•12 years ago
|
||
I'm using today's Unagi engineering build (w/Marionette enabled).
Sorry, but I understand time is of the essence, here, so I don't have a crash ID right now.
My git commit info is: 2013-01-30 09:18:04
f7fa0cd17e3d04308cc5850b254947e...
Build identifier is: 20130130063101
Reporter | ||
Comment 6•12 years ago
|
||
I tried reproducing it on a few other devices but haven't had any luck hitting it again.
Comment 7•12 years ago
|
||
This looks a lot like bug 830704. Adding tzimmerman to cc just in case this does turn out to be repro'able.
Assignee | ||
Comment 8•12 years ago
|
||
I cannot reproduce the problem, but from looking at the code, I have an idea what might happen.
The fix for bug 830704 ensures that the instance of UnixSocketImpl gets deleted after the last pending SocketReceiveTask. I assumed that in ipc/unixsocket/UnixSocket.cpp, line 626, the IO thread would have been stopped. That seems not to be the case. If the IO thread creates a SocketReceiveTask (at line 680+) after the main thread executed line 635, this SocketReceiveTask will access an invalid instance of UnixSocketImpl.
Is there a thread-safe way to stop the IO thread or make it stop watching the socket file descriptor?
Comment 9•12 years ago
|
||
Let's track this. Please nom if we can get a reliable STR.
blocking-b2g: tef? → ---
tracking-b2g18:
--- → ?
Comment 10•12 years ago
|
||
I can reproduce this issue by repeatedly connecting and disconnecting with my Nokia headset for no more than 40 times. The symptom is exactly the same as this issue.
I took a look at this problem, it's just like Thomas mentioned in comment 8, SocketReceiveTask is still possible to be created even after DeleteInstanceRunnable::run() is executed, so SocketReceiveTask will access an invalid instance of UnixSocketImpl.
Comment 11•12 years ago
|
||
Hi Kyle, could you please take a minute to see if this would work? Thanks.
Assignee: nobody → echou
Attachment #710624 -
Flags: feedback?(kyle)
Assignee | ||
Comment 12•12 years ago
|
||
Hi Eric,
Is it save to stop the watch the file descriptor from within the main thread while the I/O thread is still listening? I was looking at the problem a few days ago when Kyle cc'd me, and I couldn't find a thread-safe way to do libevent calls. Also, I can still see an opportunity for this bug to happen if the I/O thread is executing OnFileCanReadWithoutBlocking while the main thread dispatches DeleteUnixSocketImpl.
Assignee | ||
Comment 13•12 years ago
|
||
I wrote this patch a few days ago to fix this bug. The idea is to signal the I/O thread that we're about to quit and let it shut down itself and dispatch the DeleteUnixSocketImpl instance The patch has a bug somewhere and doesn't really work, but idea is sound, if think.
Attachment #710629 -
Flags: feedback?
Comment 14•12 years ago
|
||
(In reply to Thomas Zimmermann [:tzimmermann] from comment #12)
> Hi Eric,
>
> Is it save to stop the watch the file descriptor from within the main thread
> while the I/O thread is still listening? I was looking at the problem a few
> days ago when Kyle cc'd me, and I couldn't find a thread-safe way to do
> libevent calls.
Actually, I'm not pretty sure. But since we will always stop watching FDs only in CloseSocket() on main thread and in dtor of UnixSocketImpl, is it still not safe?
> Also, I can still see an opportunity for this bug to happen
> if the I/O thread is executing OnFileCanReadWithoutBlocking while the main
> thread dispatches DeleteUnixSocketImpl.
You're right, it could happen.
Comment 15•12 years ago
|
||
Ok, well yet another idea I'm not exactly loving, but...
So we /need/ to know that we're no longer watching the fds before deleting anything. So why not:
1. Make a function to run on the IO thread that will stop the watchers, that also takes the pointer to delete.
2. Make a function to run on the main thread that'll do the deletion.
3. When we run CloseSocket, queue the function in (1) on the IOThread.
4. When the IOThread gets to the task, it'll close out the watchers, then queue the deletion function to run on the main thread.
5. Main thread deletes pointer without worrying about racing.
This way, the order is implicit, doesn't require the complexity of the lock solution, and everything runs on the thread its expected to. Make sense? Am I missing anything?
Comment 16•12 years ago
|
||
Comment on attachment 710624 [details] [diff] [review]
patch 1: v1: Stop watching sockets before deleting UnixSocketImpl instance
Review of attachment 710624 [details] [diff] [review]:
-----------------------------------------------------------------
::: ipc/unixsocket/UnixSocket.cpp
@@ +629,5 @@
> mImpl = nullptr;
> // Line it up to be destructed on the IO Thread
> impl->mConsumer.forget();
> impl->StopTask();
> + impl->StopIO();
StopWatchingFileDescriptor ain't threadsafe. :(
Attachment #710624 -
Flags: feedback?(kyle) → feedback-
Comment 17•12 years ago
|
||
Talked to cjones. StopWatchingFileDescriptor most likely isn't threadsafe, so I don't think that solution will work. I think my linearizing the functions across the threads should do it, but like I said, if I'm missing something, let me know.
Assignee | ||
Comment 18•12 years ago
|
||
(In reply to Kyle Machulis [:kmachulis] [:qdot] from comment #15)
> Ok, well yet another idea I'm not exactly loving, but...
>
> So we /need/ to know that we're no longer watching the fds before deleting
> anything. So why not:
>
> 1. Make a function to run on the IO thread that will stop the watchers, that
> also takes the pointer to delete.
> 2. Make a function to run on the main thread that'll do the deletion.
> 3. When we run CloseSocket, queue the function in (1) on the IOThread.
> 4. When the IOThread gets to the task, it'll close out the watchers, then
> queue the deletion function to run on the main thread.
> 5. Main thread deletes pointer without worrying about racing.
>
> This way, the order is implicit, doesn't require the complexity of the lock
> solution, and everything runs on the thread its expected to. Make sense? Am
> I missing anything?
My patch already does something like that. It sets a flag that will make the I/O thread dispatch a DeleteUnixSocketImpl task to the main tread, which then does the real work. The locking is only for protecting the flag variable. The I/O thread needs to first wake up, though.
Another, much simpler idea is to simply close the socket file descriptor from within the main thread. I'm not sure what the I/O thread does in that case, but my guess is that we can detect that the socket has been closed and stop watching. We'd need to signal this back to the main thread, which would then dispatch the DeleteUnixSignalImpl task.
Comment 19•12 years ago
|
||
(In reply to Thomas Zimmermann [:tzimmermann] from comment #18)
> My patch already does something like that. It sets a flag that will make the
> I/O thread dispatch a DeleteUnixSocketImpl task to the main tread, which
> then does the real work. The locking is only for protecting the flag
> variable. The I/O thread needs to first wake up, though.
Yeah, that's the nice part about having it as a task, we'll know it'll run versus waiting on something incoming to trigger it.
> Another, much simpler idea is to simply close the socket file descriptor
> from within the main thread. I'm not sure what the I/O thread does in that
> case, but my guess is that we can detect that the socket has been closed
> and stop watching. We'd need to signal this back to the main thread, which
> would then dispatch the DeleteUnixSignalImpl task.
Hmm. Where would we watch for the descriptor getting closed? We don't really maintain control at all times since libevent is doing most of the work.
Updated•12 years ago
|
Comment 20•12 years ago
|
||
Hey Thomas, are you going to take this bug? :)
Flags: needinfo?(tzimmermann)
Assignee | ||
Comment 21•12 years ago
|
||
Hi Eric,
I can't reproduce it, but I'd give it another try and see how far I get.
Assignee: echou → tzimmermann
Flags: needinfo?(tzimmermann)
Comment 22•12 years ago
|
||
(In reply to Thomas Zimmermann [:tzimmermann] from comment #21)
> Hi Eric
>
> I can't reproduce it, but I'd give it another try and see how far I get.
No problem. Since we've already known what the root cause is, I think it's fine to just upload your patch and we can discuss if it would work. Thanks in advance.
Assignee | ||
Comment 23•12 years ago
|
||
Attachment #712464 -
Flags: review?(kyle)
Attachment #712464 -
Flags: review?(echou)
Assignee | ||
Comment 24•12 years ago
|
||
Attachment #712466 -
Flags: review?(kyle)
Attachment #712466 -
Flags: review?(echou)
Assignee | ||
Comment 25•12 years ago
|
||
Attachment #710624 -
Attachment is obsolete: true
Attachment #710629 -
Attachment is obsolete: true
Attachment #710629 -
Flags: feedback?
Attachment #712467 -
Flags: review?(kyle)
Attachment #712467 -
Flags: review?(echou)
Assignee | ||
Comment 26•12 years ago
|
||
Hey!
It took me several attempts to fix this bug, but I think these patches finally work.
- The first patch simply adds some locking; nothing special.
- The second patch converts SocketAcceptTask to use the I/O thread's watch mechanisms, instead of actively polling the listening socket. This is necessary because a SocketAcceptTask might execute after the UnixSocketImpl has already been deleted.
- Finally the third patch fixes the race condition by removing the socket's watchers from within the I/O thread and dispatching a delete operation.
Updated•12 years ago
|
Attachment #712464 -
Flags: review?(kyle) → review+
Updated•12 years ago
|
Attachment #712466 -
Flags: review?(kyle) → review+
Updated•12 years ago
|
Attachment #712467 -
Flags: review?(kyle) → review+
Comment 27•12 years ago
|
||
Ok, everything looks good, seems to make sense.
Updated•12 years ago
|
Attachment #712464 -
Flags: review?(echou) → review+
Updated•12 years ago
|
Attachment #712466 -
Flags: review?(echou) → review+
Updated•12 years ago
|
Attachment #712467 -
Flags: review?(echou) → review+
Comment 28•12 years ago
|
||
Looks good to me. Thanks, Thomas.
Assignee | ||
Comment 29•12 years ago
|
||
Rebased and added r=
Attachment #712464 -
Attachment is obsolete: true
Attachment #712859 -
Flags: review+
Assignee | ||
Comment 31•12 years ago
|
||
Rebased and added r=
Attachment #712466 -
Attachment is obsolete: true
Attachment #712860 -
Attachment is obsolete: true
Attachment #712861 -
Flags: review+
Assignee | ||
Comment 32•12 years ago
|
||
Rebased and added r=
Attachment #712467 -
Attachment is obsolete: true
Attachment #712862 -
Flags: review+
Assignee | ||
Updated•12 years ago
|
Keywords: checkin-needed
Comment 33•12 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/89edfdd1a350
https://hg.mozilla.org/integration/mozilla-inbound/rev/4251e6dd0218
https://hg.mozilla.org/integration/mozilla-inbound/rev/164c9a8f3711
If this ends up getting approved for b2g18 uplift, please set checkin-needed again. My b2g uplift queries ignore the Gaia component and I won't see this otherwise.
Keywords: checkin-needed
Comment 34•12 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/89edfdd1a350
https://hg.mozilla.org/mozilla-central/rev/4251e6dd0218
https://hg.mozilla.org/mozilla-central/rev/164c9a8f3711
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 35•12 years ago
|
||
Comment on attachment 712859 [details] [diff] [review]
Protect mCurrentTaskIsCanceled by lock [r=qdot,echou]
Review of attachment 712859 [details] [diff] [review]:
-----------------------------------------------------------------
::: ipc/unixsocket/UnixSocket.cpp
@@ +80,5 @@
> return;
> }
> mTask->Cancel();
> mTask = nullptr;
> + MutexAutoLock lock(mLock);
This is not the correct way of using a lock. If we needed a lock (though the patch in bug 845148 shows that it's not needed here) then the lock would need to cover the entirety of the code that relies on having a consistent value for the protected variables. The lock also misses mTask possibly being assigned and read from multiple threads at the same time. I'm going to end up nuking it in my backport for bug 845148.
Comment 36•12 years ago
|
||
This is going to need to be rebased for b2g18 uplift on top of bug 845148.
Assignee | ||
Comment 37•12 years ago
|
||
Reopened for uplift.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 38•12 years ago
|
||
(In reply to Blake Kaplan (:mrbkap) from comment #35)
> Comment on attachment 712859 [details] [diff] [review]
> Protect mCurrentTaskIsCanceled by lock [r=qdot,echou]
>
> Review of attachment 712859 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: ipc/unixsocket/UnixSocket.cpp
> @@ +80,5 @@
> > return;
> > }
> > mTask->Cancel();
> > mTask = nullptr;
> > + MutexAutoLock lock(mLock);
>
> This is not the correct way of using a lock. If we needed a lock (though the
> patch in bug 845148 shows that it's not needed here) then the lock would
> need to cover the entirety of the code that relies on having a consistent
> value for the protected variables. The lock also misses mTask possibly being
> assigned and read from multiple threads at the same time. I'm going to end
> up nuking it in my backport for bug 845148.
mlock only protects mCurrentTaskIsCanceled (and only mCurrentTaskisCanceled), which is signals to the IO thread (and only the IO thread) that the current task has been canceled from the main thread (and only the main thread). You need proper concurrency control for doing this.
Assignee | ||
Comment 39•12 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #36)
> This is going to need to be rebased for b2g18 uplift on top of bug 845148.
Both patch sets overlap considerably. I cannot rebase the changes in these patches without risking serious regressions. Instead I'd propose to remove the changes for bug 845148 from b2g18, apply the fixes for this bug, and afterwards re-apply bug-845148 fixes.
Assignee | ||
Comment 40•12 years ago
|
||
Blake, could you re-port your patch set for bug 845148 to b2g18? That would bring both patch sets into the correct order, as bug 845148 is a fix for the patches here. Porting the socket changes is probably much easier now. You should be able to reuse the previous patch to BluetoothOppManager.
Attachment #722285 -
Flags: review?(mrbkap)
Assignee | ||
Comment 41•12 years ago
|
||
Attachment #722287 -
Flags: review+
Assignee | ||
Comment 42•12 years ago
|
||
Attachment #722288 -
Flags: review+
Assignee | ||
Comment 43•12 years ago
|
||
Attachment #722291 -
Flags: review+
Assignee | ||
Updated•12 years ago
|
Attachment #722288 -
Attachment description: Wait for incoming connections in UnixSocketImpl [r=qdot,echou] → Wait for incoming connections in UnixSocketImpl [r=qdot,echou] (b2g18)
Comment 44•12 years ago
|
||
UNDO. UNDO.
The "Wait for incoming connections in UnixSocketImpl" takes out the O_NONBLOCK set in SetNonblockingFlags, which not only causes us to lock on exit and completely negates the whole reason the function exists, god only knows what it's doing to the IOThread. Why is that gone? If you don't set the socket non-block, libevent isn't going to work at ALL.
Comment 45•12 years ago
|
||
(In reply to Thomas Zimmermann [:tzimmermann] from comment #38)
> mlock only protects mCurrentTaskIsCanceled (and only
> mCurrentTaskisCanceled), which is signals to the IO thread (and only the IO
> thread) that the current task has been canceled from the main thread (and
> only the main thread). You need proper concurrency control for doing this.
As the patch in bug 845148 shows, the lock is actually unnecessary. It's much better to avoid shared mutable state when possible. The patch in that bug does so by using the already thread-safe event loops in order to serialize access to the various data structures. Here is why this lock is not useful:
The reason that threads racing to set shared state is dangerous is to avoid inconsistent state when two states set the given value at the same time. Getting or setting a boolean is an atomic operation: it is either true or false, therefore only locking around the get or set operation does not guarantee anything that the underlying memory model doesn't already give you. The reason that the lock is not sufficient is this code:
if (mTask) {
return;
}
if (IsCanceled()) {
return;
}
mTask = aTask;
needs to assume that IsCanceled() cannot change after the check. Otherwise, we can end up in a state where mIsCanceled is true and mTask is not in a canceled state. To make the lock work, it would be more correct to hold it above the mTask check (mTask also needs to be read under the same lock) and release it before posting the task to the event loop. Then StopTask would need to hold it for the entirety of its body to avoid racing against EnqueueTask.
Anyway, the approach in bug 845148 avoids all of this entirely by not having any shared mutable state. There cannot be any inconsistency because all accesses are properly serialized on all threads.
Comment 46•12 years ago
|
||
Comment on attachment 722285 [details] [diff] [review]
Backout fix 845148 (b2g18)
I'm not going to review this. The model on b2g18 is cleaner and better understood than what we'd be replacing it with. Bug 846615 has patches that sync b2g18 and m-c, which will make getting the rest of the bluetooth patches in easier everywhere.
Attachment #722285 -
Flags: review?(mrbkap) → review-
Comment 47•12 years ago
|
||
Ok, nevermind on comment 44. accept() got merged back in accidentally.
Assignee | ||
Comment 48•12 years ago
|
||
(In reply to Blake Kaplan (:mrbkap) from comment #45)
The C++ memory model does not give you any guarantees about concurrent memory access. The compiler and processor are free to reorder operations, or split them up into smaller chunks, or not execute them at all. So even though a boolean is either true or false, you'd still need memory barriers to make sure that operations are processed in the correct order, or processed at all.
Anyway, thanks for fixing the undefined memory access and cleaning up this code. From comment 46, I assume that I don't need to back-port the patches from this issue, right?
Comment 49•12 years ago
|
||
(In reply to Thomas Zimmermann [:tzimmermann] from comment #48)
> Anyway, thanks for fixing the undefined memory access and cleaning up this
> code. From comment 46, I assume that I don't need to back-port the patches
> from this issue, right?
I think that's true for most of the patches. Once bug 846615 lands on mozilla-central, the models on the two branches will mostly match each other except for patch 2 here ("Wait for incoming connections..."), bug 844705 and some naming differences. What should probably happen is that we should probably merge the names and the second patch here (assuming that comment 44 means that qDot is okay with it) and probably overwrite my changes in the Opp manager with the patch for bug 844705. At that point, the two branches should be merged.
I'm going to re-mark this as FIXED since it's fixed on m-c and we can track the landing of the second patch with branch flags.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 50•12 years ago
|
||
If everyone's in agreement that b2g18 is where it needs to be, can you please set status-b2g18 to fixed on this, bug 842434, and bug 840544? Thanks.
And please post branch-specific patches of whatever still needs to land otherwise (and bonus points for setting checkin-needed on the relevant bugs as this has become a tangled mess to follow).
Comment 51•12 years ago
|
||
Talking to Kyle on IRC, it sounds like he agrees with my plan in comment 49. Tim, would you mind wrangling the relevant patches and marking the bugs appropriately?
Comment 52•12 years ago
|
||
Any updates here?
Assignee | ||
Comment 53•12 years ago
|
||
This is a port of the second patch to b2g18. The other fixes have already been ported as part of other patches.
Attachment #722285 -
Attachment is obsolete: true
Attachment #722287 -
Attachment is obsolete: true
Attachment #722288 -
Attachment is obsolete: true
Attachment #722291 -
Attachment is obsolete: true
Attachment #727773 -
Flags: review?(kyle)
Assignee | ||
Comment 54•12 years ago
|
||
Attachment #727773 -
Attachment is obsolete: true
Attachment #727773 -
Flags: review?(kyle)
Attachment #727881 -
Flags: review?(kyle)
Attachment #727881 -
Flags: review?(echou)
Updated•12 years ago
|
Attachment #727881 -
Flags: review?(kyle) → review+
Comment 55•12 years ago
|
||
Comment on attachment 727881 [details] [diff] [review]
Wait for incoming connections in UnixSocketImpl (b2g18)
Review of attachment 727881 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good.
Attachment #727881 -
Flags: review?(echou) → review+
Assignee | ||
Comment 56•12 years ago
|
||
Updated with review tags. Please uplift.
Attachment #727881 -
Attachment is obsolete: true
Attachment #730611 -
Flags: review+
Assignee | ||
Updated•12 years ago
|
Keywords: checkin-needed
Comment 57•12 years ago
|
||
This doesn't have approval-b2g18+ or blocking status. Please get approval and re-request checkin.
status-b2g18:
--- → affected
status-b2g18-v1.0.0:
--- → wontfix
status-b2g18-v1.0.1:
--- → wontfix
status-firefox20:
--- → wontfix
status-firefox21:
--- → fixed
Keywords: checkin-needed
Target Milestone: --- → B2G C4 (2jan on)
You need to log in
before you can comment on or make changes to this bug.
Description
•