Closed
Bug 835283
Opened 11 years ago
Closed 11 years ago
Occasional crash in [@ nr_ice_socket_readable_cb][@ nr_socket_recvfrom]
Categories
(Core :: WebRTC, defect, P1)
Tracking
()
RESOLVED
DUPLICATE
of bug 830100
People
(Reporter: standard8, Assigned: ekr)
References
Details
(Keywords: crash, Whiteboard: [webrtc][blocking-webrtc+])
Crash Data
Attachments
(6 files, 1 obsolete file)
STR: 1) Have two computers hooked up to the social API demo 2) Initiate a call 3) Accept on the other end 4) As the call exchange is going through, close the call at the initiation end Results: Occasionally it crashes, at the initiator with the stack reported. I'm guessing about 1 out of 30 times on my local setup. (note: this is similar to bug 821241, but it crashes at the initiator and the stack is different) Example stacks bp-14a11b81-619c-4e93-a6ce-6c36a2130128 bp-5c47c52a-34e2-4da3-8c9e-0f0c72130128 bp-ac171cbb-a71c-45af-b01b-ee5362130128 0 XUL nr_socket_recvfrom 1 XUL nr_ice_socket_readable_cb 2 XUL mozilla::NrSocket::OnSocketReady 3 XUL nsSocketTransportService::DoPollIteration 4 XUL nsSocketTransportService::Run 5 XUL _ZThn24_N24nsSocketTransportService3RunEv 6 XUL nsThread::ProcessNextEvent 7 XUL NS_ProcessNextEvent_P 8 XUL nsThread::ThreadFunc 9 libnspr4.dylib _pt_root 10 libsystem_c.dylib 11 libsystem_c.dylib 12 libnspr4.dylib
Comment 1•11 years ago
|
||
ekr - is this a known bug?
Assignee: nobody → ekr
Priority: -- → P1
Whiteboard: [WebRTC] → [webrtc][blocking-webrtc+]
Assignee | ||
Comment 2•11 years ago
|
||
No, it's not a known bug and it's not obvious from looking at the code what the heck is going on. Suggest we land a few MOZ_CRASH tests to try to work out which null pointer we are about to dereference.
Assignee | ||
Comment 3•11 years ago
|
||
Assignee | ||
Comment 4•11 years ago
|
||
can you try the test with the above patch and see what happens?
Flags: needinfo?(mbanner)
Comment 5•11 years ago
|
||
I can help here too, but can kick off a try build with your patch? That will make investigating this go quicker.
Keywords: qawanted
Comment 7•11 years ago
|
||
Try builds are up - http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/rjesup@wgate.com-911126988d90/ Will investigate shortly...
Comment 8•11 years ago
|
||
Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from comment 0 in a few variations. I didn't manage to get a crash.
Keywords: qawanted
Comment 9•11 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #8) > Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from > comment 0 in a few variations. I didn't manage to get a crash. Note - testing of this happened using the try builds from comment 7 that includes the patch attached
Comment 10•11 years ago
|
||
In comparing my testing to today's nightly on 2/14, I tried testing this as well - I couldn't reproduce the crash with the STR in comment 0.
Reporter | ||
Comment 11•11 years ago
|
||
I've just been reproducing this on current nightly, and on a debug build of nightly with the patch applied. It does seem harder to reproduce in a debug build. Unforuntately, I'm now getting a slightly different stack: 0 XUL 0x0000000109c4b4a8 nr_ice_socket_readable_cb + 88 (ice_socket.c:58) 1 XUL 0x0000000109c81e0b mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191) 2 XUL 0x0000000109c81d03 mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67 (nr_socket_prsock.cpp:125) 3 XUL 0x0000000106083af6 nsSocketTransportService::DoPollIteration(bool) + 1158 (nsSocketTransportService2.cpp:786) 4 XUL 0x00000001060834c2 nsSocketTransportService::Run() + 274 (nsSocketTransportService2.cpp:645) 5 XUL 0x0000000106083efc non-virtual thunk to nsSocketTransportService::Run() + 28 6 XUL 0x0000000108fd34d3 nsThread::ProcessNextEvent(bool, bool*) + 1619 (nsThread.cpp:628) 7 XUL 0x0000000108f340ff NS_ProcessNextEvent_P(nsIThread*, bool) + 159 (nsThreadUtils.cpp:238) 8 XUL 0x0000000108fd1ef7 nsThread::ThreadFunc(void*) + 295 (nsThread.cpp:265) 9 libnspr4.dylib 0x0000000105948663 _pt_root + 355 (ptthread.c:159) 10 libsystem_c.dylib 0x00007fff8db6e742 _pthread_start + 327 11 libsystem_c.dylib 0x00007fff8db5b181 thread_start + 13
Flags: needinfo?(mbanner)
Reporter | ||
Comment 12•11 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #0) > STR: > > 1) Have two computers hooked up to the social API demo > 2) Initiate a call > 3) Accept on the other end > 4) As the call exchange is going through, close the call at the initiation > end I should expand on these. I've been using a Mac OS X 10.8 computer as the initiator and Windows 7 as receiver, though I suspect platform doesn't matter too much. Getting the crash is all about timing - you need to initiate the call, click accept at the receiver, then here comes the timing. You need to catch it after "Calling" disappears, but just before/as the video streams come up on the display. 1 in 15 or 1 in 30 is probably roughly what I'm getting as the crash rate.
Reporter | ||
Comment 13•11 years ago
|
||
(In reply to Mark Banner (:standard8) from comment #11) > Unfortunately, I'm now getting a slightly different stack: > > 0 XUL 0x0000000109c4b4a8 > nr_ice_socket_readable_cb + 88 (ice_socket.c:58) > 1 XUL 0x0000000109c81e0b > mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191) > 2 XUL 0x0000000109c81d03 > mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67 > (nr_socket_prsock.cpp:125) I've just been able to catch this stack in a debugger. We're crashing on this line: 58 r_log(LOG_ICE,LOG_DEBUG,"ICE(%s): Socket ready to read",sock->ctx->label); and here's the information about sock: (gdb) print sock $1 = (nr_ice_socket *) 0x11510316c (gdb) print *sock $2 = { sock = 0x5a5a5a5a5a5a5a5a, ctx = 0x5a5a5a5a5a5a5a5a, candidates = { tqh_first = 0x5a5a5a5a5a5a5a5a, tqh_last = 0x5a5a5a5a5a5a5a5a }, component = 0x5a5a5a5a5a5a5a5a, stun_ctxs = { tqh_first = 0x5a5a5a5a5a5a5a5a, tqh_last = 0x5a5a5a5a5a5a5a5a }, stun_server = 0x5a5a5a5a5a5a5a5a, stun_server_handle = 0x5a5a5a5a5a5a5a5a, entry = { stqe_next = 0x5a5a5a5a5a5a5a5a } } which iirc is deleted memory?
Reporter | ||
Comment 14•11 years ago
|
||
More information from the crash, specifically 'bt full' and 'info threads'. I have this held in a debugger and should be able to keep it there for the next 2-3 hours.
Assignee | ||
Comment 15•11 years ago
|
||
Assignee | ||
Comment 16•11 years ago
|
||
Mark: the attached patch tests that the destroy is called on the right thread. Can you please try to repro this error with the patch applied.
Assignee | ||
Updated•11 years ago
|
Attachment #714710 -
Flags: review?(adam)
Comment 17•11 years ago
|
||
Note that I wonder if this actually is the same crash... but it certainly is a crash provoked by the same actions
Comment 18•11 years ago
|
||
That crash report was another bug that can be hit with the same sequence. Marking this bug as non-blocking of preffing on the feature; low-frequency crash requiring just the right timing. I'll continue to try to repro it with ekr's asserts added.
No longer blocks: 796463
Updated•11 years ago
|
Attachment #714710 -
Flags: review?(adam) → review+
Comment 19•11 years ago
|
||
The logs above are for bug 842171
Comment 20•11 years ago
|
||
Landed the additional asserts https://hg.mozilla.org/integration/mozilla-inbound/rev/8febb0aa205f
Whiteboard: [webrtc][blocking-webrtc+] → [webrtc][blocking-webrtc+][leave-open]
Comment 21•11 years ago
|
||
Not sure this helps; the stack appears royally trashed.
Attachment #714950 -
Attachment is obsolete: true
Comment 22•11 years ago
|
||
Finally, a clean wrong-thread release of an ICE socket. I don't see how NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp); (ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause socket refcounts to go to 0 and get released.
Comment 24•11 years ago
|
||
This shows who called ns_ice_ctx_destroy() on a thread other than STS (I added a MOZ_ASSERT(!NS_IsMainThread) to Nr_async_schedule() for just the one call for destroying the ice context - a quick hack). It shows that the source is the same bug we found and reported as a different problem over the weekend - Vcm_RxStartICE_m() failing and releasing the RTP flow it just created. Looks like we need to either fix that use-case in Vcm_RxStartICE_m (and add a strong assert that flows are destroyed on STS thread), or fix it so TransportFlows can deal with being released off STS (by proxying their contents over to STS).
Comment 26•11 years ago
|
||
This (the timer issue) is a different aspect of the same bug as bug 842171 (now dupped back to here). The discussion from that bug as to the cause: When calling VcmRxStartICE_m(), vcmCreateTransportFlow() can error out due mMedia becoming NULL and cause wrong-thread releases in VcmRxStartICE_m(). This is likely due to the patch for bug 820538, and that we're still using DISPATCH_SYNC. Since pc->mMedia became NULL, we can't store the flow in it, and we release it on mainthread, which due to bug 830100 can cause a thread-safety assertion. (Note that this crashes on purpose even in an opt/release build, which limits any security implications). Releasing the TransportFlow on MainThread can cause NSS to send on mainthread (bug 842171) or a timer to be spawned on MainThread to destroy the ice context (this bug).
Assignee | ||
Comment 27•11 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #22) > Created attachment 715249 [details] > clean log of wrong-thread socket release > > Finally, a clean wrong-thread release of an ICE socket. > > I don't see how NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp); > (ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause > socket refcounts to go to 0 and get released. Huh? that's in fact the intended behavior. It just needs to happen on the right thread.
Comment 28•11 years ago
|
||
Bug 830100 has a patch awaiting final review, so that will make this bug moot when it lands.
Depends on: 830100
Comment 29•11 years ago
|
||
On second thought, it's clear this will be fixed by bug 830100, so I'll just dup it now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Whiteboard: [webrtc][blocking-webrtc+][leave-open] → [webrtc][blocking-webrtc+]
You need to log in
before you can comment on or make changes to this bug.
Description
•