Closed Bug 835283 Opened 11 years ago Closed 11 years ago

Occasional crash in [@ nr_ice_socket_readable_cb][@ nr_socket_recvfrom]

Tracking

()

Status:

RESOLVED DUPLICATE of bug 830100

People

(Reporter: standard8, Assigned: ekr)

References

Details

(Keywords: crash, Whiteboard: [webrtc][blocking-webrtc+])

Crash Data

Attachments

(6 files, 1 obsolete file)

New instrumentation for diagnosing 11 years ago Eric Rescorla (:ekr) 1005 bytes, patch		Details \| Diff \| Splinter Review
More information from crash 11 years ago Mark Banner (:standard8) 9.51 KB, text/plain		Details
Thread asserts for nr_socket_prsock 11 years ago Eric Rescorla (:ekr) 6.94 KB, patch	jesup : review+	Details \| Diff \| Splinter Review
gdb logs of a crash from thread asserts 11 years ago Randell Jesup [:jesup] (needinfo me) 35.65 KB, text/plain		Details
wrong-thread assert with trashed stack 11 years ago Randell Jesup [:jesup] (needinfo me) 15.09 KB, text/plain		Details
clean log of wrong-thread socket release 11 years ago Randell Jesup [:jesup] (needinfo me) 22.34 KB, text/plain		Details
Log of why we released the ice context on mainthread 11 years ago Randell Jesup [:jesup] (needinfo me) 14.93 KB, text/plain		Details

Mark Banner (:standard8)

Reporter

Description

•

11 years ago

STR:

1) Have two computers hooked up to the social API demo
2) Initiate a call
3) Accept on the other end
4) As the call exchange is going through, close the call at the initiation end

Results:

Occasionally it crashes, at the initiator with the stack reported. I'm guessing about 1 out of 30 times on my local setup.

(note: this is similar to bug 821241, but it crashes at the initiator and the stack is different)

Example stacks

bp-14a11b81-619c-4e93-a6ce-6c36a2130128
bp-5c47c52a-34e2-4da3-8c9e-0f0c72130128
bp-ac171cbb-a71c-45af-b01b-ee5362130128

0 	XUL 	nr_socket_recvfrom
1 	XUL 	nr_ice_socket_readable_cb
2 	XUL 	mozilla::NrSocket::OnSocketReady
3 	XUL 	nsSocketTransportService::DoPollIteration
4 	XUL 	nsSocketTransportService::Run
5 	XUL 	_ZThn24_N24nsSocketTransportService3RunEv
6 	XUL 	nsThread::ProcessNextEvent
7 	XUL 	NS_ProcessNextEvent_P
8 	XUL 	nsThread::ThreadFunc
9 	libnspr4.dylib 	_pt_root
10 	libsystem_c.dylib
11 	libsystem_c.dylib
12 	libnspr4.dylib

Randell Jesup [:jesup] (needinfo me)

Comment 1

•

11 years ago

ekr - is this a known bug?

Assignee: nobody → ekr

Priority: -- → P1

Whiteboard: [WebRTC] → [webrtc][blocking-webrtc+]

Maire Reavy [:mreavy]

Updated

•

11 years ago

Blocks: 796463

Eric Rescorla (:ekr)

Assignee

Comment 2

•

11 years ago

No, it's not a known bug and it's not obvious from looking at the code what the heck is going on.

Suggest we land a few MOZ_CRASH tests to try to work out which null pointer we are about to dereference.

Eric Rescorla (:ekr)

Assignee

Comment 3

•

11 years ago

Attached patch New instrumentation for diagnosing — Details — Splinter Review

Eric Rescorla (:ekr)

Assignee

Comment 4

•

11 years ago

can you try the test with the above patch and see what happens?

Flags: needinfo?(mbanner)

Jason Smith [:jsmith]

Comment 5

•

11 years ago

I can help here too, but can kick off a try build with your patch? That will make investigating this go quicker.

Keywords: qawanted

Jason Smith [:jsmith]

Comment 6

•

11 years ago

https://tbpl.mozilla.org/?tree=Try&rev=911126988d90

Jason Smith [:jsmith]

Comment 7

•

11 years ago

Try builds are up - http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/rjesup@wgate.com-911126988d90/

Will investigate shortly...

Jason Smith [:jsmith]

Comment 8

•

11 years ago

Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from comment 0 in a few variations. I didn't manage to get a crash.

Keywords: qawanted

Jason Smith [:jsmith]

Comment 9

•

11 years ago

(In reply to Jason Smith [:jsmith] from comment #8)
> Tested using a Win 7 machine and a OS X 10.7 machine, used the STR from
> comment 0 in a few variations. I didn't manage to get a crash.

Note - testing of this happened using the try builds from comment 7 that includes the patch attached

Jason Smith [:jsmith]

Comment 10

•

11 years ago

In comparing my testing to today's nightly on 2/14, I tried testing this as well - I couldn't reproduce the crash with the STR in comment 0.

Mark Banner (:standard8)

Reporter

Comment 11

•

11 years ago

I've just been reproducing this on current nightly, and on a debug build of nightly with the patch applied. It does seem harder to reproduce in a debug build.

Unforuntately, I'm now getting a slightly different stack:

0   XUL                           	0x0000000109c4b4a8 nr_ice_socket_readable_cb + 88 (ice_socket.c:58)
1   XUL                           	0x0000000109c81e0b mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191)
2   XUL                           	0x0000000109c81d03 mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67 (nr_socket_prsock.cpp:125)
3   XUL                           	0x0000000106083af6 nsSocketTransportService::DoPollIteration(bool) + 1158 (nsSocketTransportService2.cpp:786)
4   XUL                           	0x00000001060834c2 nsSocketTransportService::Run() + 274 (nsSocketTransportService2.cpp:645)
5   XUL                           	0x0000000106083efc non-virtual thunk to nsSocketTransportService::Run() + 28
6   XUL                           	0x0000000108fd34d3 nsThread::ProcessNextEvent(bool, bool*) + 1619 (nsThread.cpp:628)
7   XUL                           	0x0000000108f340ff NS_ProcessNextEvent_P(nsIThread*, bool) + 159 (nsThreadUtils.cpp:238)
8   XUL                           	0x0000000108fd1ef7 nsThread::ThreadFunc(void*) + 295 (nsThread.cpp:265)
9   libnspr4.dylib                	0x0000000105948663 _pt_root + 355 (ptthread.c:159)
10  libsystem_c.dylib             	0x00007fff8db6e742 _pthread_start + 327
11  libsystem_c.dylib             	0x00007fff8db5b181 thread_start + 13

Flags: needinfo?(mbanner)

Mark Banner (:standard8)

Reporter

Comment 12

•

11 years ago

(In reply to Mark Banner (:standard8) from comment #0)
> STR:
> 
> 1) Have two computers hooked up to the social API demo
> 2) Initiate a call
> 3) Accept on the other end
> 4) As the call exchange is going through, close the call at the initiation
> end

I should expand on these. I've been using a Mac OS X 10.8 computer as the initiator and Windows 7 as receiver, though I suspect platform doesn't matter too much.

Getting the crash is all about timing - you need to initiate the call, click accept at the receiver, then here comes the timing. You need to catch it after "Calling" disappears, but just before/as the video streams come up on the display.

1 in 15 or 1 in 30 is probably roughly what I'm getting as the crash rate.

Mark Banner (:standard8)

Reporter

Comment 13

•

11 years ago

(In reply to Mark Banner (:standard8) from comment #11)
> Unfortunately, I'm now getting a slightly different stack:
> 
> 0   XUL                           	0x0000000109c4b4a8
> nr_ice_socket_readable_cb + 88 (ice_socket.c:58)
> 1   XUL                           	0x0000000109c81e0b
> mozilla::NrSocket::fire_callback(int) + 219 (nr_socket_prsock.cpp:191)
> 2   XUL                           	0x0000000109c81d03
> mozilla::NrSocket::OnSocketReady(PRFileDesc*, short) + 67
> (nr_socket_prsock.cpp:125)

I've just been able to catch this stack in a debugger.

We're crashing on this line:

58	    r_log(LOG_ICE,LOG_DEBUG,"ICE(%s): Socket ready to read",sock->ctx->label);

and here's the information about sock:

(gdb) print sock
$1 = (nr_ice_socket *) 0x11510316c
(gdb) print *sock
$2 = {
  sock = 0x5a5a5a5a5a5a5a5a, 
  ctx = 0x5a5a5a5a5a5a5a5a, 
  candidates = {
    tqh_first = 0x5a5a5a5a5a5a5a5a, 
    tqh_last = 0x5a5a5a5a5a5a5a5a
  }, 
  component = 0x5a5a5a5a5a5a5a5a, 
  stun_ctxs = {
    tqh_first = 0x5a5a5a5a5a5a5a5a, 
    tqh_last = 0x5a5a5a5a5a5a5a5a
  }, 
  stun_server = 0x5a5a5a5a5a5a5a5a, 
  stun_server_handle = 0x5a5a5a5a5a5a5a5a, 
  entry = {
    stqe_next = 0x5a5a5a5a5a5a5a5a
  }
}

which iirc is deleted memory?

Mark Banner (:standard8)

Reporter

Comment 14

•

11 years ago

Attached file More information from crash — Details

More information from the crash, specifically 'bt full' and 'info threads'. I have this held in a debugger and should be able to keep it there for the next 2-3 hours.

Eric Rescorla (:ekr)

Assignee

Comment 15

•

11 years ago

Attached patch Thread asserts for nr_socket_prsock — Details — Splinter Review

Eric Rescorla (:ekr)

Assignee

Comment 16

•

11 years ago

Mark: the attached patch tests that the destroy is called on the right thread.

Can you please try to repro this error with the patch applied.

Eric Rescorla (:ekr)

Assignee

Updated

•

11 years ago

Attachment #714710 - Flags: review?(adam)

Randell Jesup [:jesup] (needinfo me)

Comment 17

•

11 years ago

Attached file gdb logs of a crash from thread asserts (obsolete) — Details

Note that I wonder if this actually is the same crash... but it certainly is a crash provoked by the same actions

Randell Jesup [:jesup] (needinfo me)

Comment 18

•

11 years ago

That crash report was another bug that can be hit with the same sequence.

Marking this bug as non-blocking of preffing on the feature; low-frequency crash requiring just the right timing.  I'll continue to try to repro it with ekr's asserts added.

No longer blocks: 796463

Randell Jesup [:jesup] (needinfo me)

Updated

•

11 years ago

Attachment #714710 - Flags: review?(adam) → review+

Randell Jesup [:jesup] (needinfo me)

Comment 19

•

11 years ago

The logs above are for bug 842171

Randell Jesup [:jesup] (needinfo me)

Comment 20

•

11 years ago

Landed the additional asserts
https://hg.mozilla.org/integration/mozilla-inbound/rev/8febb0aa205f

Whiteboard: [webrtc][blocking-webrtc+] → [webrtc][blocking-webrtc+][leave-open]

Randell Jesup [:jesup] (needinfo me)

Comment 21

•

11 years ago

Attached file wrong-thread assert with trashed stack — Details

Not sure this helps; the stack appears royally trashed.

Attachment #714950 - Attachment is obsolete: true

Randell Jesup [:jesup] (needinfo me)

Comment 22

•

11 years ago

Attached file clean log of wrong-thread socket release — Details

Finally, a clean wrong-thread release of an ICE socket.

I don't see how  NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp);
(ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause socket refcounts to go to 0 and get released.

Gregory Szorc [:gps]

Comment 23

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/8febb0aa205f

Randell Jesup [:jesup] (needinfo me)

Comment 24

•

11 years ago

Attached file Log of why we released the ice context on mainthread — Details

This shows who called ns_ice_ctx_destroy() on a thread other than STS (I added a MOZ_ASSERT(!NS_IsMainThread) to Nr_async_schedule() for just the one call for destroying the ice context - a quick hack).

It shows that the source is the same bug we found and reported as a different problem over the weekend - Vcm_RxStartICE_m() failing and releasing the RTP flow it just created.

Looks like we need to either fix that use-case in Vcm_RxStartICE_m (and add a strong assert that flows are destroyed on STS thread), or fix it so TransportFlows can deal with being released off STS (by proxying their contents over to STS).

Randell Jesup [:jesup] (needinfo me)

Comment 26

•

11 years ago

This (the timer issue) is a different aspect of the same bug as bug 842171 (now dupped back to here).  The discussion from that bug as to the cause:

When calling VcmRxStartICE_m(), vcmCreateTransportFlow() can error out due mMedia becoming NULL and cause wrong-thread releases in VcmRxStartICE_m().  

This is likely due to the patch for bug 820538, and that we're still using DISPATCH_SYNC.

Since pc->mMedia became NULL, we can't store the flow in it, and we release it on mainthread, which due to bug 830100 can cause a thread-safety assertion.  (Note that this crashes on purpose even in an opt/release build, which limits any security implications).


Releasing the TransportFlow on MainThread can cause NSS to send on mainthread (bug 842171) or a timer to be spawned on MainThread to destroy the ice context (this bug).

Eric Rescorla (:ekr)

Assignee

Comment 27

•

11 years ago

(In reply to Randell Jesup [:jesup] from comment #22)
> Created attachment 715249 [details]
> clean log of wrong-thread socket release
> 
> Finally, a clean wrong-thread release of an ICE socket.
> 
> I don't see how  NR_ASYNC_SCHEDULE(nr_ice_ctx_destroy_cb,*ctxp);
> (ice_ctx.c:376) can ever be correct unless it's *certain* it won't cause
> socket refcounts to go to 0 and get released.

Huh? that's in fact the intended behavior. It just needs to happen
on the right thread.

Randell Jesup [:jesup] (needinfo me)

Comment 28

•

11 years ago

Bug 830100 has a patch awaiting final review, so that will make this bug moot when it lands.

Depends on: 830100

Randell Jesup [:jesup] (needinfo me)

Comment 29

•

11 years ago

On second thought, it's clear this will be fixed by bug 830100, so I'll just dup it now.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → DUPLICATE

Whiteboard: [webrtc][blocking-webrtc+][leave-open] → [webrtc][blocking-webrtc+]

You need to log in before you can comment on or make changes to this bug.