Last Comment Bug 835283 - Occasional crash in [@ nr_ice_socket_readable_cb][@ nr_socket_recvfrom]
: Occasional crash in [@ nr_ice_socket_readable_cb][@ nr_socket_recvfrom]
Status: RESOLVED DUPLICATE of bug 830100
[webrtc][blocking-webrtc+]
: crash
Product: Core
Classification: Components
Component: WebRTC (show other bugs)
: unspecified
: All Mac OS X
P1 critical (vote)
: ---
Assigned To: Eric Rescorla (:ekr)
: Jason Smith [:jsmith]
: Nils Ohlmeier [:drno]
Mentors:
: 842171 (view as bug list)
Depends on: 830100
Blocks:
  Show dependency treegraph
 
Reported: 2013-01-28 02:42 PST by Mark Banner (:standard8)
Modified: 2013-03-22 20:15 PDT (History)
9 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
New instrumentation for diagnosing (1005 bytes, patch)
2013-02-13 11:12 PST, Eric Rescorla (:ekr)
no flags Details | Diff | Splinter Review
More information from crash (9.51 KB, text/plain)
2013-02-15 06:07 PST, Mark Banner (:standard8)
no flags Details
Thread asserts for nr_socket_prsock (6.94 KB, patch)
2013-02-15 18:30 PST, Eric Rescorla (:ekr)
jesup: review+
Details | Diff | Splinter Review
gdb logs of a crash from thread asserts (35.65 KB, text/plain)
2013-02-17 12:00 PST, Randell Jesup [:jesup]
no flags Details
wrong-thread assert with trashed stack (15.09 KB, text/plain)
2013-02-18 13:45 PST, Randell Jesup [:jesup]
no flags Details
clean log of wrong-thread socket release (22.34 KB, text/plain)
2013-02-18 14:43 PST, Randell Jesup [:jesup]
no flags Details
Log of why we released the ice context on mainthread (14.93 KB, text/plain)
2013-02-18 23:04 PST, Randell Jesup [:jesup]
no flags Details

Description User image Mark Banner (:standard8) 2013-01-28 02:42:56 PST
STR:

1) Have two computers hooked up to the social API demo
2) Initiate a call
3) Accept on the other end
4) As the call exchange is going through, close the call at the initiation end

Results:

Occasionally it crashes, at the initiator with the stack reported. I'm guessing about 1 out of 30 times on my local setup.

(note: this is similar to bug 821241, but it crashes at the initiator and the stack is different)

Example stacks

bp-14a11b81-619c-4e93-a6ce-6c36a2130128
bp-5c47c52a-34e2-4da3-8c9e-0f0c72130128
bp-ac171cbb-a71c-45af-b01b-ee5362130128

0 	XUL 	nr_socket_recvfrom
1 	XUL 	nr_ice_socket_readable_cb
2 	XUL 	mozilla::NrSocket::OnSocketReady
3 	XUL 	nsSocketTransportService::DoPollIteration
4 	XUL 	nsSocketTransportService::Run
5 	XUL 	_ZThn24_N24nsSocketTransportService3RunEv
6 	XUL 	nsThread::ProcessNextEvent
7 	XUL 	NS_ProcessNextEvent_P
8 	XUL 	nsThread::ThreadFunc
9 	libnspr4.dylib 	_pt_root
10 	libsystem_c.dylib
11 	libsystem_c.dylib
12 	libnspr4.dylib
Comment 1 User image Randell Jesup [:jesup] 2013-01-28 03:46:32 PST
ekr - is this a known bug?
Comment 2 User image Eric Rescorla (:ekr) 2013-02-10 22:34:05 PST
No, it's not a known bug and it's not obvious from looking at the code what the heck is going on.

Suggest we land a few MOZ_CRASH tests to try to work out which null pointer we are about to dereference.
Comment 3 User image Eric Rescorla (:ekr) 2013-02-13 11:12:13 PST
Created attachment 713523 [details] [diff] [review]
New instrumentation for diagnosing
Comment 4 User image Eric Rescorla (:ekr) 2013-02-13 11:12:47 PST
can you try the test with the above patch and see what happens?
Comment 5 User image Jason Smith [:jsmith] 2013-02-14 10:47:45 PST
I can help here too, but can kick off a try build with your patch? That will make investigating this go quicker.
Comment 6 User image Jason Smith [:jsmith] 2013-02-14 11:35:54 PST
https://tbpl.mozilla.org/?tree=Try&rev=911126988d90
Comment 7 User image Jason Smith [:jsmith] 2013-02-14 15:09:17 PST
Try builds are up - http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/rjesup@wgate.com-911126988d90/

Will investigate shortly...
Comment 8 User image Jason Smith [:jsmith] 2013-02-14 15:53:04 PST
Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from comment 0 in a few variations. I didn't manage to get a crash.
Comment 9 User image Jason Smith [:jsmith] 2013-02-14 15:53:40 PST
(In reply to Jason Smith [:jsmith] from comment #8)
> Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from
> comment 0 in a few variations. I didn't manage to get a crash.

Note - testing of this happened using the try builds from comment 7 that includes the patch attached
Comment 10 User image Jason Smith [:jsmith] 2013-02-14 15:59:52 PST
In comparing my testing to today's nightly on 2/14, I tried testing this as well - I couldn't reproduce the crash with the STR in comment 0.
Comment 11 User image Mark Banner (:standard8) 2013-02-15 02:48:04 PST
I've just been reproducing this on current nightly, and on a debug build of nightly with the patch applied. It does seem harder to reproduce in a debug build.

Unforuntately, I'm now getting a slightly different stack:

0   XUL                           	0x0000000109c4b4a8 nr_ice_socket_readable_cb + 88 (ice_socket.c:58)
1   XUL                           	0x0000000109c81e0b mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191)
2   XUL                           	0x0000000109c81d03 mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67 (nr_socket_prsock.cpp:125)
3   XUL                           	0x0000000106083af6 nsSocketTransportService::DoPollIteration(bool) + 1158 (nsSocketTransportService2.cpp:786)
4   XUL                           	0x00000001060834c2 nsSocketTransportService::Run() + 274 (nsSocketTransportService2.cpp:645)
5   XUL                           	0x0000000106083efc non-virtual thunk to nsSocketTransportService::Run() + 28
6   XUL                           	0x0000000108fd34d3 nsThread::ProcessNextEvent(bool, bool*) + 1619 (nsThread.cpp:628)
7   XUL                           	0x0000000108f340ff NS_ProcessNextEvent_P(nsIThread*, bool) + 159 (nsThreadUtils.cpp:238)
8   XUL                           	0x0000000108fd1ef7 nsThread::ThreadFunc(void*) + 295 (nsThread.cpp:265)
9   libnspr4.dylib                	0x0000000105948663 _pt_root + 355 (ptthread.c:159)
10  libsystem_c.dylib             	0x00007fff8db6e742 _pthread_start + 327
11  libsystem_c.dylib             	0x00007fff8db5b181 thread_start + 13
Comment 12 User image Mark Banner (:standard8) 2013-02-15 02:50:29 PST
(In reply to Mark Banner (:standard8) from comment #0)
> STR:
> 
> 1) Have two computers hooked up to the social API demo
> 2) Initiate a call
> 3) Accept on the other end
> 4) As the call exchange is going through, close the call at the initiation
> end

I should expand on these. I've been using a Mac OS X 10.8 computer as the initiator and Windows 7 as receiver, though I suspect platform doesn't matter too much.

Getting the crash is all about timing - you need to initiate the call, click accept at the receiver, then here comes the timing. You need to catch it after "Calling" disappears, but just before/as the video streams come up on the display.

1 in 15 or 1 in 30 is probably roughly what I'm getting as the crash rate.
Comment 13 User image Mark Banner (:standard8) 2013-02-15 03:13:42 PST
(In reply to Mark Banner (:standard8) from comment #11)
> Unfortunately, I'm now getting a slightly different stack:
> 
> 0   XUL                           	0x0000000109c4b4a8
> nr_ice_socket_readable_cb + 88 (ice_socket.c:58)
> 1   XUL                           	0x0000000109c81e0b
> mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191)
> 2   XUL                           	0x0000000109c81d03
> mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67
> (nr_socket_prsock.cpp:125)

I've just been able to catch this stack in a debugger.

We're crashing on this line:

58	    r_log(LOG_ICE,LOG_DEBUG,"ICE(%s): Socket ready to read",sock->ctx->label);

and here's the information about sock:

(gdb) print sock
$1 = (nr_ice_socket *) 0x11510316c
(gdb) print *sock
$2 = {
  sock = 0x5a5a5a5a5a5a5a5a, 
  ctx = 0x5a5a5a5a5a5a5a5a, 
  candidates = {
    tqh_first = 0x5a5a5a5a5a5a5a5a, 
    tqh_last = 0x5a5a5a5a5a5a5a5a
  }, 
  component = 0x5a5a5a5a5a5a5a5a, 
  stun_ctxs = {
    tqh_first = 0x5a5a5a5a5a5a5a5a, 
    tqh_last = 0x5a5a5a5a5a5a5a5a
  }, 
  stun_server = 0x5a5a5a5a5a5a5a5a, 
  stun_server_handle = 0x5a5a5a5a5a5a5a5a, 
  entry = {
    stqe_next = 0x5a5a5a5a5a5a5a5a
  }
}

which iirc is deleted memory?
Comment 14 User image Mark Banner (:standard8) 2013-02-15 06:07:57 PST
Created attachment 714346 [details]
More information from crash

More information from the crash, specifically 'bt full' and 'info threads'. I have this held in a debugger and should be able to keep it there for the next 2-3 hours.
Comment 15 User image Eric Rescorla (:ekr) 2013-02-15 18:30:46 PST
Created attachment 714710 [details] [diff] [review]
Thread asserts for nr_socket_prsock
Comment 16 User image Eric Rescorla (:ekr) 2013-02-15 18:33:05 PST
Mark: the attached patch tests that the destroy is called on the right thread.

Can you please try to repro this error with the patch applied.
Comment 17 User image Randell Jesup [:jesup] 2013-02-17 12:00:28 PST
Created attachment 714950 [details]
gdb logs of a crash from thread asserts

Note that I wonder if this actually is the same crash... but it certainly is a crash provoked by the same actions
Comment 18 User image Randell Jesup [:jesup] 2013-02-18 07:55:19 PST
That crash report was another bug that can be hit with the same sequence.

Marking this bug as non-blocking of preffing on the feature; low-frequency crash requiring just the right timing.  I'll continue to try to repro it with ekr's asserts added.
Comment 19 User image Randell Jesup [:jesup] 2013-02-18 07:58:47 PST
The logs above are for bug 842171
Comment 20 User image Randell Jesup [:jesup] 2013-02-18 13:29:34 PST
Landed the additional asserts
https://hg.mozilla.org/integration/mozilla-inbound/rev/8febb0aa205f
Comment 21 User image Randell Jesup [:jesup] 2013-02-18 13:45:00 PST
Created attachment 715233 [details]
wrong-thread assert with trashed stack

Not sure this helps; the stack appears royally trashed.
Comment 22 User image Randell Jesup [:jesup] 2013-02-18 14:43:00 PST
Created attachment 715249 [details]
clean log of wrong-thread socket release

Finally, a clean wrong-thread release of an ICE socket.

I don't see how  NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp);
(ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause socket refcounts to go to 0 and get released.
Comment 23 User image Gregory Szorc [:gps] 2013-02-18 21:26:14 PST
https://hg.mozilla.org/mozilla-central/rev/8febb0aa205f
Comment 24 User image Randell Jesup [:jesup] 2013-02-18 23:04:12 PST
Created attachment 715351 [details]
Log of why we released the ice context on mainthread

This shows who called ns_ice_ctx_destroy() on a thread other than STS (I added a MOZ_ASSERT(!NS_IsMainThread) to Nr_async_schedule() for just the one call for destroying the ice context - a quick hack).

It shows that the source is the same bug we found and reported as a different problem over the weekend - Vcm_RxStartICE_m() failing and releasing the RTP flow it just created.

Looks like we need to either fix that use-case in Vcm_RxStartICE_m (and add a strong assert that flows are destroyed on STS thread), or fix it so TransportFlows can deal with being released off STS (by proxying their contents over to STS).
Comment 25 User image Randell Jesup [:jesup] 2013-02-18 23:17:36 PST
*** Bug 842171 has been marked as a duplicate of this bug. ***
Comment 26 User image Randell Jesup [:jesup] 2013-02-18 23:18:32 PST
This (the timer issue) is a different aspect of the same bug as bug 842171 (now dupped back to here).  The discussion from that bug as to the cause:

When calling VcmRxStartICE_m(), vcmCreateTransportFlow() can error out due mMedia becoming NULL and cause wrong-thread releases in VcmRxStartICE_m().  

This is likely due to the patch for bug 820538, and that we're still using DISPATCH_SYNC.

Since pc->mMedia became NULL, we can't store the flow in it, and we release it on mainthread, which due to bug 830100 can cause a thread-safety assertion.  (Note that this crashes on purpose even in an opt/release build, which limits any security implications).


Releasing the TransportFlow on MainThread can cause NSS to send on mainthread (bug 842171) or a timer to be spawned on MainThread to destroy the ice context (this bug).
Comment 27 User image Eric Rescorla (:ekr) 2013-02-19 02:05:04 PST
(In reply to Randell Jesup [:jesup] from comment #22)
> Created attachment 715249 [details]
> clean log of wrong-thread socket release
> 
> Finally, a clean wrong-thread release of an ICE socket.
> 
> I don't see how  NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp);
> (ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause
> socket refcounts to go to 0 and get released.

Huh? that's in fact the intended behavior. It just needs to happen
on the right thread.
Comment 28 User image Randell Jesup [:jesup] 2013-03-22 20:13:11 PDT
Bug 830100 has a patch awaiting final review, so that will make this bug moot when it lands.
Comment 29 User image Randell Jesup [:jesup] 2013-03-22 20:15:56 PDT
On second thought, it's clear this will be fixed by bug 830100, so I'll just dup it now.

*** This bug has been marked as a duplicate of bug 830100 ***

Note You need to log in before you can comment on or make changes to this bug.