Closed Bug 863224 Opened 9 years ago Closed 9 years ago

ASSERTION and crash: You can't dereference a NULL nsRefPtr with operator->().: 'mRawPtr != 0', file ../../dist/include/nsAutoPtr.h, line 1028 [ @mozilla::DOMMediaStream::StreamListener::TrackChange::Run() ]

Categories

(Core :: Audio/Video, defect)

23 Branch
x86_64
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla23
Tracking Status
firefox22 --- wontfix
firefox23 + verified

People

(Reporter: whimboo, Assigned: roc)

References

Details

(Keywords: assertion, crash, Whiteboard: [WebRTC][blocking-webrtc+][qa-])

Crash Data

Attachments

(2 files, 2 obsolete files)

I see a constant assertion and crash on shutdown with one of my testcases in a recent m-c debug build built off 129074:b553e9ca2354. I'm working on getting a minimized testcase.

0:28.00 ###!!! ASSERTION: You can't dereference a NULL nsRefPtr with operator->().: 'mRawPtr != 0', file ../../dist/include/nsAutoPtr.h, line 1028
 0:35.18 mozilla::DOMMediaStream::StreamListener::TrackChange::Run() [content/media/MediaStreamTrack.h:51]
 0:35.18 nsThread::ProcessNextEvent(bool, bool*) [xpcom/threads/nsThread.cpp:627]
 0:35.18 NS_ProcessNextEvent(nsIThread*, bool) [/mozilla/code/firefox/obj/debug/xpcom/build/nsThreadUtils.cpp:238]
 0:35.18 mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:83]
 0:35.18 MessageLoop::RunInternal() [ipc/chromium/src/base/message_loop.cc:217]
 0:35.18 MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:503]
 0:35.18 nsBaseAppShell::Run() [widget/xpwidgets/nsBaseAppShell.cpp:165]
 0:35.18 nsAppStartup::Run() [toolkit/components/startup/nsAppStartup.cpp:289]
 0:35.18 XREMain::XRE_mainRun() [toolkit/xre/nsAppRunner.cpp:3882]
 0:35.18 XREMain::XRE_main(int, char**, nsXREAppData const*) [toolkit/xre/nsAppRunner.cpp:3948]
 0:35.18 XRE_main [toolkit/xre/nsAppRunner.cpp:4154]
 0:35.19 do_main [browser/app/nsBrowserApp.cpp:271]
 0:35.19 main [browser/app/nsBrowserApp.cpp:578]
 0:35.19 libc.so.6 + 0x2176d
 0:35.19 main [browser/app/nsBrowserApp.cpp:590]
 0:35.19 TEST-UNEXPECTED-FAIL| automation.py | Exited with code 11 during test run
 0:35.19 INFO | automation.py | Application ran for: 0:00:33.880652
 0:35.19 INFO | zombiecheck | Reading PID log: /tmp/tmpRpd0JDpidlog
PROCESS-CRASH | automation.py | application crashed [@ mozilla::DOMMediaStream::StreamListener::TrackChange::Run()]
Crash dump filename: /tmp/tmpVPpLNt/minidumps/4e356b16-db6b-216a-3379c9ff-39afc840.dmp
Operating system: Linux
                  0.0.0 Linux 3.5.0-27-generic #46-Ubuntu SMP Mon Mar 25 19:58:17 UTC 2013 x86_64
CPU: amd64
     family 6 model 58 stepping 9
     4 CPUs

Crash reason:  SIGSEGV
Crash address: 0x74

Thread 0 (crashed)
 0  libxul.so!mozilla::DOMMediaStream::StreamListener::TrackChange::Run() [MediaStreamTrack.h:6ca65ae1dfd2 : 51 + 0x0]
    rbx = 0x00007f733c053320   r12 = 0x0000000000000000
    r13 = 0x0000000000000001   r14 = 0x00007fff6a143730
    r15 = 0x00007fff6a14379f   rip = 0x00007f73dd6e774b
    rsp = 0x00007fff6a1436e0   rbp = 0x00007fff6a143700
    Found by: given as instruction pointer in context
 1  libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:6ca65ae1dfd2 : 627 + 0x11]
    rip = 0x00007f73de2ecbbb   rsp = 0x00007fff6a143710
    rbp = 0x00007fff6a143780
    Found by: stack scanning
 2  libxul.so!TimerThread::RemoveTimer(nsTimerImpl*) [TimerThread.cpp:6ca65ae1dfd2 : 354 + 0x8]
    rip = 0x00007f73de2f2142   rsp = 0x00007fff6a143750
    rbp = 0x00007fff6a143780
    Found by: stack scanning
 3  libxul.so!NS_ProcessNextEvent(nsIThread*, bool) [nsThreadUtils.cpp : 238 + 0xc]
    rip = 0x00007f73de2a979e   rsp = 0x00007fff6a143790
    rbp = 0x00007fff6a1437b0
    Found by: stack scanning
 4  libxul.so!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [MessagePump.cpp:6ca65ae1dfd2 : 82 + 0xa]
Component: WebRTC → Video/Audio
QA Contact: jsmith
Attached patch test for assertion (obsolete) — Splinter Review
Steps to reproduce:

1. Apply the patch
2. Run ./mach mochitest-plain dom/media/tests/mochitest/test_peerConnection_basicAudio.html
3. Wait until the test is done
4. Wait ten or more seconds
5. Close Firefox with Alt+F4

During shutdown you can see this assertion and crash.
Looks like a media stream track bug.

Roc - Any ideas?
Flags: needinfo?(roc)
Crash Signature: [ @mozilla::DOMMediaStream::StreamListener::TrackChange::Run() ] → [@ mozilla::DOMMediaStream::StreamListener::TrackChange::Run() ]
Probably a bug in the new code that landed for Bug 834835
Whiteboard: [WebRTC][blocking-webrtc?] → [WebRTC][blocking-webrtc+]
Assignee: nobody → roc
0:28.00 ###!!! ASSERTION: You can't dereference a NULL nsRefPtr with operator->().: 'mRawPtr != 0', file ../../dist/include/nsAutoPtr.h, line 1028
 0:35.18 mozilla::DOMMediaStream::StreamListener::TrackChange::Run() [content/media/MediaStreamTrack.h:51]

This is actually DOMMediaStream.cpp:66 I think; 'track' must be null.
This check should not be needed. I need to debug this.
Flags: needinfo?(roc)
I'm not sure if this bug is valid. If so, it would be a bug in the patches on Bug 834835 and will want to land with those patches.  

Rob -- based on your investigation, is this bug valid?
Flags: needinfo?(roc)
I haven't done any investigation yet. I will before I reland 834835.
Flags: needinfo?(roc)
I reapplied my patches in bug 834835 to mozilla-central and tried to reproduce this bug, but couldn't.
If try results look good I'll reland 834835 with this bug open. If someone can still reproduce it then, I will help them debug it.
Earlier today Rob (roc) re-landed bug 834835 on inbound about 9 hours ago.  Can you retest with inbound and see if you still see this bug?  If yes, can you give us a test case (even if it's the same one as in comment 1)?  If not, can you close this bug?  Thanks.
Blocks: 834835
Flags: needinfo?(hskupin)
I just did a pretty thorough test run with media stream tracks on a 4/27 build and could not reproduce the assertion in this bug either.

Closing for now unless we get a clear STR and reproduction.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(hskupin)
Resolution: --- → WORKSFORME
(In reply to Jason Smith [:jsmith] from comment #13)
> Closing for now unless we get a clear STR and reproduction.

Not quite sure which missing information you are referring here but this bug has all the steps necessary to do in comment 1 and comment 2.

This assertion is still happening for me with m-c changeset 130165:05533d50f2f7 from late yesterday.

Here updated information from gdb:

Program received signal SIGSEGV, Segmentation fault.
mozilla::DOMMediaStream::StreamListener::TrackChange::Run (this=0x7fff4804fb10)
    at /mozilla/code/firefox/nightly/content/media/DOMMediaStream.cpp:66
66	        track->NotifyEnded();
(gdb) bt
#0  mozilla::DOMMediaStream::StreamListener::TrackChange::Run (this=0x7fff4804fb10)
    at /mozilla/code/firefox/nightly/content/media/DOMMediaStream.cpp:66
#1  0x00007ffff2eb21fb in nsThread::ProcessNextEvent (this=0x710f90, mayWait=false, result=0x7fffffffc3ff)
    at /mozilla/code/firefox/nightly/xpcom/threads/nsThread.cpp:627
#2  0x00007ffff2e7063e in NS_ProcessNextEvent (thread=<optimized out>, mayWait=<optimized out>)
    at /mozilla/code/firefox/obj/debug/xpcom/build/nsThreadUtils.cpp:238
#3  0x00007ffff2b884ad in mozilla::ipc::MessagePump::Run (this=0x6cbfb0, aDelegate=0x70e200)
    at /mozilla/code/firefox/nightly/ipc/glue/MessagePump.cpp:82
#4  0x00007ffff2ee668c in MessageLoop::RunInternal (this=this@entry=0x70e200)
    at /mozilla/code/firefox/nightly/ipc/chromium/src/base/message_loop.cc:219
#5  0x00007ffff2ee66b4 in RunHandler (this=0x70e200)
    at /mozilla/code/firefox/nightly/ipc/chromium/src/base/message_loop.cc:212
#6  MessageLoop::Run (this=0x70e200) at /mozilla/code/firefox/nightly/ipc/chromium/src/base/message_loop.cc:186
#7  0x00007ffff2a92b79 in nsBaseAppShell::Run (this=0xa45f80)
    at /mozilla/code/firefox/nightly/widget/xpwidgets/nsBaseAppShell.cpp:163
#8  0x00007ffff28e4e2f in nsAppStartup::Run (this=0x6cf5d0)
    at /mozilla/code/firefox/nightly/toolkit/components/startup/nsAppStartup.cpp:289
#9  0x00007ffff1c89ee4 in XREMain::XRE_mainRun (this=this@entry=0x7fffffffc6d0)
    at /mozilla/code/firefox/nightly/toolkit/xre/nsAppRunner.cpp:3879
#10 0x00007ffff1c8a1b7 in XREMain::XRE_main (this=this@entry=0x7fffffffc6d0, argc=argc@entry=5, 
    argv=argv@entry=0x7fffffffdbc8, aAppData=aAppData@entry=0x7fffffffc8c0)
    at /mozilla/code/firefox/nightly/toolkit/xre/nsAppRunner.cpp:3946
#11 0x00007ffff1c8a425 in XRE_main (argc=5, argv=0x7fffffffdbc8, aAppData=0x7fffffffc8c0, aFlags=<optimized out>)
    at /mozilla/code/firefox/nightly/toolkit/xre/nsAppRunner.cpp:4147
#12 0x00000000004028be in do_main (argc=argc@entry=5, argv=argv@entry=0x7fffffffdbc8, xreDirectory=0x6150f0)
    at /mozilla/code/firefox/nightly/browser/app/nsBrowserApp.cpp:271
#13 0x0000000000401e3a in main (argc=5, argv=0x7fffffffdbc8)
    at /mozilla/code/firefox/nightly/browser/app/nsBrowserApp.cpp:576
(gdb) l
61	        track = stream->CreateDOMTrack(mID, mType);
62	      } else {
63	        track = stream->GetDOMTrackFor(mID);
64	      }
65	      if (mEvents & MediaStreamListener::TRACK_EVENT_ENDED) {
66	        track->NotifyEnded();
67	      }
68	      return NS_OK;
69	    }
(gdb) p track
$1 = {mRawPtr = 0x0}

This means we are trying to access a null track, which might already have been released.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Okay, I got the reproduction now with your test case. For me, it does require some timing of when you close firefox after completing the mochitest.
I'm going to put this back into question because this is only reproducible in a well-timed mochitest in a debug build.

An analysis of the severity of the crash will also help determine if we need to block or not.
Whiteboard: [WebRTC][blocking-webrtc+] → [WebRTC][blocking-webrtc?]
Not sure, but might it be related to when we run the garbage collector? I have to wait more than 10s after the test has been finished to see this assertion. Not sure how often we run GC.
Status: REOPENED → NEW
A little more often than that, but not a lot more often.
Given the trigger is "GC ran after the test", that sounds like something quite possible to hit in real-world code, so blocking+.

Roc, can you look into this one ASAP?  Thanks!
Flags: needinfo?(roc)
Whiteboard: [WebRTC][blocking-webrtc?] → [WebRTC][blocking-webrtc+][webrtc_uplift]
Roc, if you can't still reproduce please let me know how I can help. Not sure how to debug further on my own given the lack of experience in that area.
I reproduced this once, but I haven't been able to reproduce it in the debugger yet :-(
Flags: needinfo?(roc)
OK I can reproduce it enough now.

It's a cycle collector problem of some kind. We run Unlink on some DOMMediaStreams but the destructor for the unlinked stream doesn't run.

Andrew, what have I done wrong? This class is nsWrapperCached.
http://dxr.mozilla.org/mozilla-central/content/media/DOMMediaStream.h#l36
http://dxr.mozilla.org/mozilla-central/content/media/DOMMediaStream.cpp#l15
DOMMediaStream looks okay to me.  The problem is going to lie in whatever holds a strong reference to the DOMMediaStream (well, or indirectly, but let's hope it isn't that).  Because it is being unlinked, but not destroyed, the simplest possibility is that something is traversing this pointer, but not unlinking it.

Maybe the best approach is to get a CC dump (set the env vars XPCOM_CC_LOG_ALL=1 and  MOZ_CC_LOG_DIRECTORY to some directory) and then figure out what is owning the object in question?  Well, I'd try looking at what classes strongly own these streams first, and make sure their traverse and unlink functions match up, but failing that, staring at some logs might be helpful.
Looks like PeerConnectionImpl's LocalSourceStreamInfo / RemoteSourceStreamInfo hold strong references to DOMMediaStreams but don't participate in cycle collection at all. I suppose that's not related to the problem here.
I can't find the addresses of my Unlinked DOMMediaStreams anywhere in the cycle collector log files.
Scrub that, I've been looking at the wrong files.
The CC graph shows only two references to the DOMMediaStream. One is from the JS wrapper object, the other is from the MediaStreamTrack that it owns (so, part of a cycle). MediaStreamTrack has
NS_IMPL_CYCLE_COLLECTION_INHERITED_1(MediaStreamTrack, nsDOMEventTargetHelper,
                                     mStream)
Also, a printf in the destructor shows that mStream is null when the MediaStreamTrack is destroyed. So we must have unlinked it.
It looks like if a second cycle collection occurs, then we do destroy the DOMMediaStream.
When we crash, the refcount of the DOMMediaStream is 1. I think that means that the JS wrapper is still holding onto the DOMMediaStream host object. I guess the DOMMediaStream gets cleaned up later when the JS wrapper is collected.
Attached patch fixSplinter Review
I'm assuming it's expected behavior that we can return to the main thread event loop after DOMMediaStream is unlinked and and before its JS wrapper is collected and the DOMMediaStream destroyed.

In any case, this fix fixes the bug and has no downside.
Attachment #739116 - Attachment is obsolete: true
Attachment #739116 - Flags: review?(roc)
Attachment #744081 - Flags: review?(rjesup)
Attachment #744081 - Flags: feedback?(continuation)
Comment on attachment 744081 [details] [diff] [review]
fix

Review of attachment 744081 [details] [diff] [review]:
-----------------------------------------------------------------

Modulo I don't fully grok the CC #defines, r+
Attachment #744081 - Flags: review?(rjesup) → review+
Comment on attachment 744081 [details] [diff] [review]
fix

Review of attachment 744081 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/media/DOMMediaStream.cpp
@@ +23,5 @@
>  
>  NS_IMPL_CYCLE_COLLECTING_ADDREF(DOMMediaStream)
>  NS_IMPL_CYCLE_COLLECTING_RELEASE(DOMMediaStream)
> +NS_IMPL_CYCLE_COLLECTION_UNLINK_BEGIN(DOMMediaStream)
> +  tmp->Destroy();

Yeah, the behavior you are seeing is normal.  Only the GC can destroy wrappers, so wrappercached things will stay around until the next GC.  In at least some cases, that's because the wrapper needs information about its native when it does, but I don't recall precisely what.

I don't really understand what is the matter, though.  Somehow with the sequence CC->CC you properly clear out something, whereas with CC->GC you don't?

::: content/media/MediaStreamTrack.h
@@ +30,5 @@
>     */
>    MediaStreamTrack(DOMMediaStream* aStream, TrackID aTrackID);
>    virtual ~MediaStreamTrack();
>  
> +  NS_DECL_ISUPPORTS_INHERITED

I wish we had some way to turn this error into a compile failure.  It is a really easy mistake to make, particularly when you are changing a class around.
Attachment #744081 - Flags: feedback?(continuation) → feedback+
(In reply to Andrew McCreight [:mccr8] from comment #32)
> I don't really understand what is the matter, though.  Somehow with the
> sequence CC->CC you properly clear out something, whereas with CC->GC you
> don't?

Basically the state "Unlinked but not C++-destroyed" is a bad state and running arbitrary code against the object in that state is bad. The event that runs the arbitrary code only has a weak reference to the DOMMediaStream, so doesn't stop it from entering the bad state. The patch here disconnects that reference during Unlink so we're all good.
Thanks for the explanation!  This pattern of "define a destroy method, call it in both unlink and the destructor" pops up not infrequently, and is rather unfortunate.
https://hg.mozilla.org/mozilla-central/rev/243a19f75873
Status: NEW → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Whiteboard: [WebRTC][blocking-webrtc+][webrtc_uplift] → [WebRTC][blocking-webrtc+][webrtc_uplift][qa-]
Whiteboard: [WebRTC][blocking-webrtc+][webrtc_uplift][qa-] → [WebRTC][blocking-webrtc+][webrtc-uplift][qa-]
Verified fixed with a debug build of changeset 131239:d0043772259e. Thanks Roc!
Status: RESOLVED → VERIFIED
Comment on attachment 744081 [details] [diff] [review]
fix

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 834835 (already uplifted)

User impact if declined: Crashes if GC runs after certain operations; or requiring backout of an important compatibility feature (bug 834835)

Testing completed (on m-c, etc.): On MC.  One regression (bug 870002) from either this bug, bug 866514 or bug 868406 (all of which need uplift except bug 868406).  The regression is windows-only, and refuses to produce a minidump; we're planning on taking a test machine offline to run the tests in a way that we can grab a dump.  The regression only affects use of this new feature, and it has not been seen yet except on the test servers.

Risk to taking this patch (and alternatives if risky): moderate - the risk is involved in resolving bug 870002 or deciding/finding-out that it's very hard to hit in practice (as opposed to tests).  We can also back out bug 866514 if it's found to be the cause, or back out all of these and bug 834835, though that would be a painful decision as it would push developers to use "back-door"/non-standard ways to figure out if a stream has video, or they'd write code that works in Chrome but not in Firefox.

String or IDL/UUID changes made by this patch: none
Attachment #744081 - Flags: approval-mozilla-aurora?
Attachment #744081 - Flags: approval-mozilla-aurora? → approval-mozilla-beta?
(In reply to Randell Jesup [:jesup] from comment #40)
> Comment on attachment 744081 [details] [diff] [review]
> fix
> 
> [Approval Request Comment]
> Bug caused by (feature/regressing bug #): 834835 (already uplifted)
> 
> User impact if declined: Crashes if GC runs after certain operations; or
> requiring backout of an important compatibility feature (bug 834835)
> 
> Testing completed (on m-c, etc.): On MC.  One regression (bug 870002) from
> either this bug, bug 866514 or bug 868406 (all of which need uplift except
> bug 868406).  The regression is windows-only, and refuses to produce a
> minidump; we're planning on taking a test machine offline to run the tests
> in a way that we can grab a dump.  The regression only affects use of this
> new feature, and it has not been seen yet except on the test servers.
> 
> Risk to taking this patch (and alternatives if risky): moderate - the risk
> is involved in resolving bug 870002 or deciding/finding-out that it's very
> hard to hit in practice (as opposed to tests).  We can also back out bug
> 866514 if it's found to be the cause, or back out all of these and bug
> 834835, though that would be a painful decision as it would push developers
> to use "back-door"/non-standard ways to figure out if a stream has video, or
> they'd write code that works in Chrome but not in Firefox.
> 
> String or IDL/UUID changes made by this patch: none

Comment 16 suggests we shouldn't take a forward fix. Can you address the fact that this is an edge case and shutdown crash only reproduced in mochitests, and probably not worth introducing bug 870002 (if this is the cause)?
No longer blocks: 870002
We've confirmed this is *not* the cause of bug 870002 btw. bug 866514 caused this.
FWIW, I don't think I've seen this crash on release builds.
Comment on attachment 744081 [details] [diff] [review]
fix

Since this isn't the cause, approving for Beta 22.
Attachment #744081 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I suggest we simply don't fix this on beta in light of comment #43.
Sounds good to me.
Whiteboard: [WebRTC][blocking-webrtc+][webrtc-uplift][qa-] → [WebRTC][blocking-webrtc+][qa-]
Duplicate of this bug: 881745
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #47)
> I suggest we simply don't fix this on beta in light of comment #43.

In case it changes our course of action here, I'm getting this failure pretty frequently on the Bananabread demo, running Beta 22:

https://crash-stats.mozilla.com/report/index/bp-7a785880-b315-425a-ad7d-d87a92130625
https://crash-stats.mozilla.com/report/index/bp-f4ec5f70-ac71-4664-8ed2-db6752130625
Two notes on my comment above: (1) I was not trying to close firefox at the time, only leave the game arena; (2) Given that we just uplifted beta yesterday, I guess this bug is out in release now -- which means we have browser crashes in release for one of the cooler WebRTC demos.

Can we fix this kind of crash in release, or is bugfix in release reserved for chemspills?
Firefox 22.0 is about to get released today. So there is nothing we can do here anymore for 22.0. :( The frequency of such crashes is still low, but lets see how this will change once 22.0 is released.

https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3ADOMMediaStream%3A%3AStreamListener%3A%3ATrackChange%3A%3ARun%28%29
You need to log in before you can comment on or make changes to this bug.