Occasional crash in [@ nr_ice_socket_readable_cb][@ nr_socket_recvfrom]

RESOLVED DUPLICATE of bug 830100

Status

()

P1
critical
RESOLVED DUPLICATE of bug 830100
6 years ago
6 years ago

People

(Reporter: standard8, Assigned: ekr)

Tracking

({crash})

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [webrtc][blocking-webrtc+], crash signature)

Attachments

(6 attachments, 1 obsolete attachment)

(Reporter)

Description

6 years ago
STR:

1) Have two computers hooked up to the social API demo
2) Initiate a call
3) Accept on the other end
4) As the call exchange is going through, close the call at the initiation end

Results:

Occasionally it crashes, at the initiator with the stack reported. I'm guessing about 1 out of 30 times on my local setup.

(note: this is similar to bug 821241, but it crashes at the initiator and the stack is different)

Example stacks

bp-14a11b81-619c-4e93-a6ce-6c36a2130128
bp-5c47c52a-34e2-4da3-8c9e-0f0c72130128
bp-ac171cbb-a71c-45af-b01b-ee5362130128

0 	XUL 	nr_socket_recvfrom
1 	XUL 	nr_ice_socket_readable_cb
2 	XUL 	mozilla::NrSocket::OnSocketReady
3 	XUL 	nsSocketTransportService::DoPollIteration
4 	XUL 	nsSocketTransportService::Run
5 	XUL 	_ZThn24_N24nsSocketTransportService3RunEv
6 	XUL 	nsThread::ProcessNextEvent
7 	XUL 	NS_ProcessNextEvent_P
8 	XUL 	nsThread::ThreadFunc
9 	libnspr4.dylib 	_pt_root
10 	libsystem_c.dylib
11 	libsystem_c.dylib
12 	libnspr4.dylib
ekr - is this a known bug?
Assignee: nobody → ekr
Priority: -- → P1
Whiteboard: [WebRTC] → [webrtc][blocking-webrtc+]
(Assignee)

Comment 2

6 years ago
No, it's not a known bug and it's not obvious from looking at the code what the heck is going on.

Suggest we land a few MOZ_CRASH tests to try to work out which null pointer we are about to dereference.
(Assignee)

Comment 3

6 years ago
Created attachment 713523 [details] [diff] [review]
New instrumentation for diagnosing
(Assignee)

Comment 4

6 years ago
can you try the test with the above patch and see what happens?
Flags: needinfo?(mbanner)
I can help here too, but can kick off a try build with your patch? That will make investigating this go quicker.
Keywords: qawanted
Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from comment 0 in a few variations. I didn't manage to get a crash.
Keywords: qawanted
(In reply to Jason Smith [:jsmith] from comment #8)
> Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from
> comment 0 in a few variations. I didn't manage to get a crash.

Note - testing of this happened using the try builds from comment 7 that includes the patch attached
In comparing my testing to today's nightly on 2/14, I tried testing this as well - I couldn't reproduce the crash with the STR in comment 0.
(Reporter)

Comment 11

6 years ago
I've just been reproducing this on current nightly, and on a debug build of nightly with the patch applied. It does seem harder to reproduce in a debug build.

Unforuntately, I'm now getting a slightly different stack:

0   XUL                           	0x0000000109c4b4a8 nr_ice_socket_readable_cb + 88 (ice_socket.c:58)
1   XUL                           	0x0000000109c81e0b mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191)
2   XUL                           	0x0000000109c81d03 mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67 (nr_socket_prsock.cpp:125)
3   XUL                           	0x0000000106083af6 nsSocketTransportService::DoPollIteration(bool) + 1158 (nsSocketTransportService2.cpp:786)
4   XUL                           	0x00000001060834c2 nsSocketTransportService::Run() + 274 (nsSocketTransportService2.cpp:645)
5   XUL                           	0x0000000106083efc non-virtual thunk to nsSocketTransportService::Run() + 28
6   XUL                           	0x0000000108fd34d3 nsThread::ProcessNextEvent(bool, bool*) + 1619 (nsThread.cpp:628)
7   XUL                           	0x0000000108f340ff NS_ProcessNextEvent_P(nsIThread*, bool) + 159 (nsThreadUtils.cpp:238)
8   XUL                           	0x0000000108fd1ef7 nsThread::ThreadFunc(void*) + 295 (nsThread.cpp:265)
9   libnspr4.dylib                	0x0000000105948663 _pt_root + 355 (ptthread.c:159)
10  libsystem_c.dylib             	0x00007fff8db6e742 _pthread_start + 327
11  libsystem_c.dylib             	0x00007fff8db5b181 thread_start + 13
Flags: needinfo?(mbanner)
(Reporter)

Comment 12

6 years ago
(In reply to Mark Banner (:standard8) from comment #0)
> STR:
> 
> 1) Have two computers hooked up to the social API demo
> 2) Initiate a call
> 3) Accept on the other end
> 4) As the call exchange is going through, close the call at the initiation
> end

I should expand on these. I've been using a Mac OS X 10.8 computer as the initiator and Windows 7 as receiver, though I suspect platform doesn't matter too much.

Getting the crash is all about timing - you need to initiate the call, click accept at the receiver, then here comes the timing. You need to catch it after "Calling" disappears, but just before/as the video streams come up on the display.

1 in 15 or 1 in 30 is probably roughly what I'm getting as the crash rate.
(Reporter)

Comment 13

6 years ago
(In reply to Mark Banner (:standard8) from comment #11)
> Unfortunately, I'm now getting a slightly different stack:
> 
> 0   XUL                           	0x0000000109c4b4a8
> nr_ice_socket_readable_cb + 88 (ice_socket.c:58)
> 1   XUL                           	0x0000000109c81e0b
> mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191)
> 2   XUL                           	0x0000000109c81d03
> mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67
> (nr_socket_prsock.cpp:125)

I've just been able to catch this stack in a debugger.

We're crashing on this line:

58	    r_log(LOG_ICE,LOG_DEBUG,"ICE(%s): Socket ready to read",sock->ctx->label);

and here's the information about sock:

(gdb) print sock
$1 = (nr_ice_socket *) 0x11510316c
(gdb) print *sock
$2 = {
  sock = 0x5a5a5a5a5a5a5a5a, 
  ctx = 0x5a5a5a5a5a5a5a5a, 
  candidates = {
    tqh_first = 0x5a5a5a5a5a5a5a5a, 
    tqh_last = 0x5a5a5a5a5a5a5a5a
  }, 
  component = 0x5a5a5a5a5a5a5a5a, 
  stun_ctxs = {
    tqh_first = 0x5a5a5a5a5a5a5a5a, 
    tqh_last = 0x5a5a5a5a5a5a5a5a
  }, 
  stun_server = 0x5a5a5a5a5a5a5a5a, 
  stun_server_handle = 0x5a5a5a5a5a5a5a5a, 
  entry = {
    stqe_next = 0x5a5a5a5a5a5a5a5a
  }
}

which iirc is deleted memory?
(Reporter)

Comment 14

6 years ago
Created attachment 714346 [details]
More information from crash

More information from the crash, specifically 'bt full' and 'info threads'. I have this held in a debugger and should be able to keep it there for the next 2-3 hours.
(Assignee)

Comment 15

6 years ago
Created attachment 714710 [details] [diff] [review]
Thread asserts for nr_socket_prsock
(Assignee)

Comment 16

6 years ago
Mark: the attached patch tests that the destroy is called on the right thread.

Can you please try to repro this error with the patch applied.
(Assignee)

Updated

6 years ago
Attachment #714710 - Flags: review?(adam)
Created attachment 714950 [details]
gdb logs of a crash from thread asserts

Note that I wonder if this actually is the same crash... but it certainly is a crash provoked by the same actions
That crash report was another bug that can be hit with the same sequence.

Marking this bug as non-blocking of preffing on the feature; low-frequency crash requiring just the right timing.  I'll continue to try to repro it with ekr's asserts added.
No longer blocks: 796463

Updated

6 years ago
Attachment #714710 - Flags: review?(adam) → review+
The logs above are for bug 842171
Landed the additional asserts
https://hg.mozilla.org/integration/mozilla-inbound/rev/8febb0aa205f
Whiteboard: [webrtc][blocking-webrtc+] → [webrtc][blocking-webrtc+][leave-open]
Created attachment 715233 [details]
wrong-thread assert with trashed stack

Not sure this helps; the stack appears royally trashed.
Attachment #714950 - Attachment is obsolete: true
Created attachment 715249 [details]
clean log of wrong-thread socket release

Finally, a clean wrong-thread release of an ICE socket.

I don't see how  NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp);
(ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause socket refcounts to go to 0 and get released.
Created attachment 715351 [details]
Log of why we released the ice context on mainthread

This shows who called ns_ice_ctx_destroy() on a thread other than STS (I added a MOZ_ASSERT(!NS_IsMainThread) to Nr_async_schedule() for just the one call for destroying the ice context - a quick hack).

It shows that the source is the same bug we found and reported as a different problem over the weekend - Vcm_RxStartICE_m() failing and releasing the RTP flow it just created.

Looks like we need to either fix that use-case in Vcm_RxStartICE_m (and add a strong assert that flows are destroyed on STS thread), or fix it so TransportFlows can deal with being released off STS (by proxying their contents over to STS).

Updated

6 years ago
Duplicate of this bug: 842171
This (the timer issue) is a different aspect of the same bug as bug 842171 (now dupped back to here).  The discussion from that bug as to the cause:

When calling VcmRxStartICE_m(), vcmCreateTransportFlow() can error out due mMedia becoming NULL and cause wrong-thread releases in VcmRxStartICE_m().  

This is likely due to the patch for bug 820538, and that we're still using DISPATCH_SYNC.

Since pc->mMedia became NULL, we can't store the flow in it, and we release it on mainthread, which due to bug 830100 can cause a thread-safety assertion.  (Note that this crashes on purpose even in an opt/release build, which limits any security implications).


Releasing the TransportFlow on MainThread can cause NSS to send on mainthread (bug 842171) or a timer to be spawned on MainThread to destroy the ice context (this bug).
(Assignee)

Comment 27

6 years ago
(In reply to Randell Jesup [:jesup] from comment #22)
> Created attachment 715249 [details]
> clean log of wrong-thread socket release
> 
> Finally, a clean wrong-thread release of an ICE socket.
> 
> I don't see how  NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp);
> (ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause
> socket refcounts to go to 0 and get released.

Huh? that's in fact the intended behavior. It just needs to happen
on the right thread.
Bug 830100 has a patch awaiting final review, so that will make this bug moot when it lands.
Depends on: 830100
On second thought, it's clear this will be fixed by bug 830100, so I'll just dup it now.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → DUPLICATE
Whiteboard: [webrtc][blocking-webrtc+][leave-open] → [webrtc][blocking-webrtc+]
Duplicate of bug: 830100
You need to log in before you can comment on or make changes to this bug.