Closed Bug 835283 Opened 11 years ago Closed 11 years ago

Occasional crash in [@ nr_ice_socket_readable_cb][@ nr_socket_recvfrom]

Categories

(Core :: WebRTC, defect, P1)

All
macOS
defect

Tracking

()

RESOLVED DUPLICATE of bug 830100

People

(Reporter: standard8, Assigned: ekr)

References

Details

(Keywords: crash, Whiteboard: [webrtc][blocking-webrtc+])

Crash Data

Attachments

(6 files, 1 obsolete file)

STR:

1) Have two computers hooked up to the social API demo
2) Initiate a call
3) Accept on the other end
4) As the call exchange is going through, close the call at the initiation end

Results:

Occasionally it crashes, at the initiator with the stack reported. I'm guessing about 1 out of 30 times on my local setup.

(note: this is similar to bug 821241, but it crashes at the initiator and the stack is different)

Example stacks

bp-14a11b81-619c-4e93-a6ce-6c36a2130128
bp-5c47c52a-34e2-4da3-8c9e-0f0c72130128
bp-ac171cbb-a71c-45af-b01b-ee5362130128

0 	XUL 	nr_socket_recvfrom
1 	XUL 	nr_ice_socket_readable_cb
2 	XUL 	mozilla::NrSocket::OnSocketReady
3 	XUL 	nsSocketTransportService::DoPollIteration
4 	XUL 	nsSocketTransportService::Run
5 	XUL 	_ZThn24_N24nsSocketTransportService3RunEv
6 	XUL 	nsThread::ProcessNextEvent
7 	XUL 	NS_ProcessNextEvent_P
8 	XUL 	nsThread::ThreadFunc
9 	libnspr4.dylib 	_pt_root
10 	libsystem_c.dylib
11 	libsystem_c.dylib
12 	libnspr4.dylib
ekr - is this a known bug?
Assignee: nobody → ekr
Priority: -- → P1
Whiteboard: [WebRTC] → [webrtc][blocking-webrtc+]
Blocks: 796463
No, it's not a known bug and it's not obvious from looking at the code what the heck is going on.

Suggest we land a few MOZ_CRASH tests to try to work out which null pointer we are about to dereference.
can you try the test with the above patch and see what happens?
Flags: needinfo?(mbanner)
I can help here too, but can kick off a try build with your patch? That will make investigating this go quicker.
Keywords: qawanted
Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from comment 0 in a few variations. I didn't manage to get a crash.
Keywords: qawanted
(In reply to Jason Smith [:jsmith] from comment #8)
> Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from
> comment 0 in a few variations. I didn't manage to get a crash.

Note - testing of this happened using the try builds from comment 7 that includes the patch attached
In comparing my testing to today's nightly on 2/14, I tried testing this as well - I couldn't reproduce the crash with the STR in comment 0.
I've just been reproducing this on current nightly, and on a debug build of nightly with the patch applied. It does seem harder to reproduce in a debug build.

Unforuntately, I'm now getting a slightly different stack:

0   XUL                           	0x0000000109c4b4a8 nr_ice_socket_readable_cb + 88 (ice_socket.c:58)
1   XUL                           	0x0000000109c81e0b mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191)
2   XUL                           	0x0000000109c81d03 mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67 (nr_socket_prsock.cpp:125)
3   XUL                           	0x0000000106083af6 nsSocketTransportService::DoPollIteration(bool) + 1158 (nsSocketTransportService2.cpp:786)
4   XUL                           	0x00000001060834c2 nsSocketTransportService::Run() + 274 (nsSocketTransportService2.cpp:645)
5   XUL                           	0x0000000106083efc non-virtual thunk to nsSocketTransportService::Run() + 28
6   XUL                           	0x0000000108fd34d3 nsThread::ProcessNextEvent(bool, bool*) + 1619 (nsThread.cpp:628)
7   XUL                           	0x0000000108f340ff NS_ProcessNextEvent_P(nsIThread*, bool) + 159 (nsThreadUtils.cpp:238)
8   XUL                           	0x0000000108fd1ef7 nsThread::ThreadFunc(void*) + 295 (nsThread.cpp:265)
9   libnspr4.dylib                	0x0000000105948663 _pt_root + 355 (ptthread.c:159)
10  libsystem_c.dylib             	0x00007fff8db6e742 _pthread_start + 327
11  libsystem_c.dylib             	0x00007fff8db5b181 thread_start + 13
Flags: needinfo?(mbanner)
(In reply to Mark Banner (:standard8) from comment #0)
> STR:
> 
> 1) Have two computers hooked up to the social API demo
> 2) Initiate a call
> 3) Accept on the other end
> 4) As the call exchange is going through, close the call at the initiation
> end

I should expand on these. I've been using a Mac OS X 10.8 computer as the initiator and Windows 7 as receiver, though I suspect platform doesn't matter too much.

Getting the crash is all about timing - you need to initiate the call, click accept at the receiver, then here comes the timing. You need to catch it after "Calling" disappears, but just before/as the video streams come up on the display.

1 in 15 or 1 in 30 is probably roughly what I'm getting as the crash rate.
(In reply to Mark Banner (:standard8) from comment #11)
> Unfortunately, I'm now getting a slightly different stack:
> 
> 0   XUL                           	0x0000000109c4b4a8
> nr_ice_socket_readable_cb + 88 (ice_socket.c:58)
> 1   XUL                           	0x0000000109c81e0b
> mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191)
> 2   XUL                           	0x0000000109c81d03
> mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67
> (nr_socket_prsock.cpp:125)

I've just been able to catch this stack in a debugger.

We're crashing on this line:

58	    r_log(LOG_ICE,LOG_DEBUG,"ICE(%s): Socket ready to read",sock->ctx->label);

and here's the information about sock:

(gdb) print sock
$1 = (nr_ice_socket *) 0x11510316c
(gdb) print *sock
$2 = {
  sock = 0x5a5a5a5a5a5a5a5a, 
  ctx = 0x5a5a5a5a5a5a5a5a, 
  candidates = {
    tqh_first = 0x5a5a5a5a5a5a5a5a, 
    tqh_last = 0x5a5a5a5a5a5a5a5a
  }, 
  component = 0x5a5a5a5a5a5a5a5a, 
  stun_ctxs = {
    tqh_first = 0x5a5a5a5a5a5a5a5a, 
    tqh_last = 0x5a5a5a5a5a5a5a5a
  }, 
  stun_server = 0x5a5a5a5a5a5a5a5a, 
  stun_server_handle = 0x5a5a5a5a5a5a5a5a, 
  entry = {
    stqe_next = 0x5a5a5a5a5a5a5a5a
  }
}

which iirc is deleted memory?
More information from the crash, specifically 'bt full' and 'info threads'. I have this held in a debugger and should be able to keep it there for the next 2-3 hours.
Mark: the attached patch tests that the destroy is called on the right thread.

Can you please try to repro this error with the patch applied.
Attachment #714710 - Flags: review?(adam)
Note that I wonder if this actually is the same crash... but it certainly is a crash provoked by the same actions
That crash report was another bug that can be hit with the same sequence.

Marking this bug as non-blocking of preffing on the feature; low-frequency crash requiring just the right timing.  I'll continue to try to repro it with ekr's asserts added.
No longer blocks: 796463
Attachment #714710 - Flags: review?(adam) → review+
The logs above are for bug 842171
Landed the additional asserts
https://hg.mozilla.org/integration/mozilla-inbound/rev/8febb0aa205f
Whiteboard: [webrtc][blocking-webrtc+] → [webrtc][blocking-webrtc+][leave-open]
Not sure this helps; the stack appears royally trashed.
Attachment #714950 - Attachment is obsolete: true
Finally, a clean wrong-thread release of an ICE socket.

I don't see how  NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp);
(ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause socket refcounts to go to 0 and get released.
This shows who called ns_ice_ctx_destroy() on a thread other than STS (I added a MOZ_ASSERT(!NS_IsMainThread) to Nr_async_schedule() for just the one call for destroying the ice context - a quick hack).

It shows that the source is the same bug we found and reported as a different problem over the weekend - Vcm_RxStartICE_m() failing and releasing the RTP flow it just created.

Looks like we need to either fix that use-case in Vcm_RxStartICE_m (and add a strong assert that flows are destroyed on STS thread), or fix it so TransportFlows can deal with being released off STS (by proxying their contents over to STS).
This (the timer issue) is a different aspect of the same bug as bug 842171 (now dupped back to here).  The discussion from that bug as to the cause:

When calling VcmRxStartICE_m(), vcmCreateTransportFlow() can error out due mMedia becoming NULL and cause wrong-thread releases in VcmRxStartICE_m().  

This is likely due to the patch for bug 820538, and that we're still using DISPATCH_SYNC.

Since pc->mMedia became NULL, we can't store the flow in it, and we release it on mainthread, which due to bug 830100 can cause a thread-safety assertion.  (Note that this crashes on purpose even in an opt/release build, which limits any security implications).


Releasing the TransportFlow on MainThread can cause NSS to send on mainthread (bug 842171) or a timer to be spawned on MainThread to destroy the ice context (this bug).
(In reply to Randell Jesup [:jesup] from comment #22)
> Created attachment 715249 [details]
> clean log of wrong-thread socket release
> 
> Finally, a clean wrong-thread release of an ICE socket.
> 
> I don't see how  NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp);
> (ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause
> socket refcounts to go to 0 and get released.

Huh? that's in fact the intended behavior. It just needs to happen
on the right thread.
Bug 830100 has a patch awaiting final review, so that will make this bug moot when it lands.
Depends on: 830100
On second thought, it's clear this will be fixed by bug 830100, so I'll just dup it now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
Whiteboard: [webrtc][blocking-webrtc+][leave-open] → [webrtc][blocking-webrtc+]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: