Last Comment Bug 798323 - Peer connections are leaking a lot of threads and memory, which can not be released by GC
: Peer connections are leaking a lot of threads and memory, which can not be re...
Status: RESOLVED FIXED
[WebRTC][blocking-webrtc+][MemShrink:...
: crash, mlk, testcase
Product: Core
Classification: Components
Component: WebRTC (show other bugs)
: 18 Branch
: All All
: P1 critical (vote)
: mozilla20
Assigned To: Randell Jesup [:jesup]
: Jason Smith [:jsmith]
:
Mentors:
Depends on: 802538 807103
Blocks: 802982 812648
  Show dependency treegraph
 
Reported: 2012-10-05 06:17 PDT by Henrik Skupin (:whimboo)
Modified: 2012-12-30 06:25 PST (History)
18 users (show)
hskupin: in‑testsuite-
jsmith: in‑moztrap-
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
affected
unaffected
fixed


Attachments
testcase (2.76 KB, text/html)
2012-10-05 06:38 PDT, Henrik Skupin (:whimboo)
no flags Details
testcase (3.22 KB, text/html)
2012-10-15 01:34 PDT, Henrik Skupin (:whimboo)
no flags Details

Description Henrik Skupin (:whimboo) 2012-10-05 06:17:04 PDT
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:18.0) Gecko/18.0 Firefox/18.0

While working on the Mochitest for bug 796890 I have seen a crash on exit of the browser. Checking its details I have noticed a strange amount of running threads which were about 200 of them. I considered this as a massive leak and continued investigation.

To reproduce the thread leak you have to do the following steps:
1. Checkout the latest code from the alder tree
2. Apply the attachment 668290 [details] [diff] [review] from bug 795367 to add the Mochitest makefiles 
3. Apply the attachment 668414 [details] [diff] [review] of the video and audio peer connection WIP test
4. Run: TEST_PATH=dom/tests/mochitest/media/test_peerconnection_videoAudio.html make -C $(OBJ) mochitest-plain
5. Reload the page a couple of times or open new tabs with the test loaded
6. Close all the tabs
7. Shutdown Firefox

It doesn't matter how long you wait the aquired threads are never released. Each time you load the test the number of thread will be increased by about 15.

The crash information for my debug build is not that helpful so far:

Operating system: Mac OS X
                  10.7.5 11G56
CPU: amd64
     family 6 model 42 stepping 7
     4 CPUs

Crash reason:  EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
Crash address: 0x0

Thread 15 (crashed)
 0  XUL + 0x267f168
    rbx = 0x0000000000001000   r12 = 0x0000000000000046
    r13 = 0x0000000000000046   r14 = 0x000000013d7dd000
    r15 = 0x00007fff7f2040a8   rip = 0x0000000103a4f168
    rsp = 0x000000013671aef0   rbp = 0x000000013671b150
    Found by: given as instruction pointer in context

I will try to get more details. Also I will make it a standalone testcase so that steps 1-3 are not necessary.

I consider this important enough to ask for blocking Firefox 18.0.
Comment 1 Maire Reavy [:mreavy] 2012-10-05 06:23:44 PDT
Adding Anant.  I don't think we should block all of WebRTC on this bug since WebRTC is preffed off and there is a great deal of value in getting WebRTC onto mozilla-central and into 18.  But we will want a fix for this ASAP (ideally before uplift), and I will strongly push for getting this uplifted into Aurora within the first few days if we can't fix it before uplift.
Comment 2 Henrik Skupin (:whimboo) 2012-10-05 06:38:24 PDT
Created attachment 668421 [details]
testcase

Simplified test which only has to be loaded once. It will auto-refresh each 5 seconds.
Comment 3 Henrik Skupin (:whimboo) 2012-10-05 06:39:36 PDT
I have to say that the crash only happens in my debug build so far. I can't the tinderbox build from Tuesday to crash.
Comment 4 Jason Smith [:jsmith] 2012-10-05 12:52:08 PDT
Smells like a blocker to me.
Comment 5 Anant Narayanan [:anant] 2012-10-05 13:42:58 PDT
I can't reproduce this. The example in this bug doesn't work - you're using .src instead of .mozSrcObject, and createFakeMediaStream is no longer a valid function, you have to use gUM({fake:true}).

On another test page, I see some leaks, but I don't see a shutdown crash. I'm investigating the leaks.

I don't think this should block landing as we're not going to flip the pref so unless someone specifically turns the functionality on they will not see the leak or the crash.
Comment 6 Anant Narayanan [:anant] 2012-10-05 13:43:52 PDT
I should explicitly mention that this should definitely block turning on the pref, which is a different question from getting PeerConnection landed into mozilla-inbound today.
Comment 7 Henrik Skupin (:whimboo) 2012-10-05 14:06:54 PDT
Oh, so my build was from Monday so probably before that change went in. That would explain why it doesn't work for you anymore. You will be able to reproduce when you use a tinderbox build from Monday.

I can update the test to make it work with the latest tinderbox build. But one question ahead, does it mean I can no longer create a peer connection without requesting the media via gUM before?
Comment 8 Anant Narayanan [:anant] 2012-10-05 14:09:11 PDT
You can definitely create a PeerConnection, but you can't call addStream until you've acquired a valid stream (real or fake) from gUM.
Comment 9 Henrik Skupin (:whimboo) 2012-10-06 06:21:06 PDT
(In reply to Anant Narayanan [:anant] from comment #8)
> You can definitely create a PeerConnection, but you can't call addStream
> until you've acquired a valid stream (real or fake) from gUM.

I'm still not sure why I need gUM here. As the latest source on the alder branch shows the createFakeMediaStream() method is still available for the PeerConnection object:

http://hg.mozilla.org/projects/alder/file/38146b56dbdd/dom/media/PeerConnection.js#l482

So I have tested my attached testcase with the latest tinderbox and own debug build and I can still reproduce this problem. Within a minute the threads are up to about 300 and cpu load >100%. On which platform are you testing on that you can't reproduce it? I talked with Erk on IRC and he was able to reproduce it. As he mentioned it's somewhat related to the event loop.
Comment 10 Anant Narayanan [:anant] 2012-10-06 09:58:38 PDT
createFakeMediaStream is on alder right now, but it won't be shortly. gUM is the correct way to obtain a fake stream.

Oh, I can definitely reproduce the thread problem, just not the shutdown crash.
Comment 11 Henrik Skupin (:whimboo) 2012-10-06 11:50:10 PDT
With the latest alder build from about 4h ago I get the following stack information from the Apple crash report:

Thread 5 Crashed:: Socket Thread
0   libsystem_kernel.dylib        	0x00007fff8e1a6ce2 __pthread_kill + 10
1   libsystem_c.dylib             	0x00007fff98dc87d2 pthread_kill + 95
2   libsystem_c.dylib             	0x00007fff98db9a7a abort + 143
3   libnspr4.dylib                	0x00000001005090fd PR_Assert + 109
4   libnss3.dylib                 	0x000000010061c1d2 PK11_GetInternalKeySlot + 98 (pk11slot.c:1754)
5   XUL                           	0x00000001042f13bc _ZN7mozillaL18nr_crypto_nss_hmacEPhiS0_iS0_ + 188 (nricectx.cpp:114)
6   XUL                           	0x00000001042d4bc1 nr_stun_compute_message_integrity + 209 (stun_codec.c:757)
7   XUL                           	0x00000001042d1eb8 nr_stun_attr_codec_message_integrity_encode + 88 (stun_codec.c:775)
8   XUL                           	0x00000001042d3180 nr_stun_encode_message + 1056 (stun_codec.c:1270)
9   XUL                           	0x00000001042ced89 nr_stun_client_send_request + 1577 (stun_client_ctx.c:358)
10  XUL                           	0x00000001042ce6e5 nr_stun_client_start + 149 (stun_client_ctx.c:112)
11  XUL                           	0x00000001042bdea6 nr_ice_candidate_pair_start + 230 (ice_candidate_pair.c:366)
12  XUL                           	0x00000001042c4df7 nr_ice_media_stream_check_timer_cb + 567 (ice_media_stream.c:333)
13  XUL                           	0x00000001042fb54a mozilla::nrappkitTimerCallback::Notify(nsITimer*) + 90 (nr_timer.cpp:53)
14  XUL                           	0x00000001036d9869 nsTimerImpl::Fire() + 1065 (nsTimerImpl.cpp:477)
15  XUL                           	0x00000001036d9bd6 nsTimerEvent::Run() + 230 (nsTimerImpl.cpp:558)
16  XUL                           	0x00000001036cf60d nsThread::ProcessNextEvent(bool, bool*) + 1373 (nsThread.cpp:613)
17  XUL                           	0x000000010363d13f NS_ProcessNextEvent_P(nsIThread*, bool) + 159 (nsThreadUtils.cpp:220)
18  XUL                           	0x000000010133fb1b nsSocketTransportService::Run() + 331 (nsSocketTransportService2.cpp:652)
19  XUL                           	0x00000001013404bc non-virtual thunk to nsSocketTransportService::Run() + 28
20  XUL                           	0x00000001036cf60d nsThread::ProcessNextEvent(bool, bool*) + 1373 (nsThread.cpp:613)
21  XUL                           	0x000000010363d13f NS_ProcessNextEvent_P(nsIThread*, bool) + 159 (nsThreadUtils.cpp:220)
22  XUL                           	0x00000001036ce1f7 nsThread::ThreadFunc(void*) + 295 (nsThread.cpp:256)
23  libnspr4.dylib                	0x0000000100538793 _pt_root + 355 (ptthread.c:159)
24  libsystem_c.dylib             	0x00007fff98dc68bf _pthread_start + 335
25  libsystem_c.dylib             	0x00007fff98dc9b75 thread_start + 13
Comment 12 Henrik Skupin (:whimboo) 2012-10-15 01:10:45 PDT
Can we get some progress on it? As it looks like it will block all of our mochitests. Some minutes ago I have noticed the leak too when doing the following steps:

1. Open http://people.mozilla.com/~anarayanan/webrtc/gum_test.html
2. Activate Video and stop it right away
3. Activate Audio

Some seconds later you will see a constant increase of the memory used. Even closing the tab and forcing a GC doesn't reduce the memory.
Comment 13 Henrik Skupin (:whimboo) 2012-10-15 01:34:09 PDT
Created attachment 671342 [details]
testcase

Updated testcase for the new gUM API.
Comment 14 Henrik Skupin (:whimboo) 2012-10-15 02:23:57 PDT
I can't reproduce the crash on shutdown anymore on OS X. If I can I will file it as a separate bug.
Comment 15 Peter Van der Beken [:peterv] 2012-10-15 08:39:18 PDT
I'm pretty sure we'll need to hook up a bunch of the WebRTC classes to cycle collection. As an example, PeerConnectionImpl holds a window which makes it trivial to cause leaks if it's not CC'ed.
Comment 16 Nicholas Nethercote [:njn] 2012-10-16 17:00:17 PDT
> I should explicitly mention that this should definitely block turning on the
> pref

+1
Comment 17 Henrik Skupin (:whimboo) 2012-10-27 21:20:59 PDT
Randell, has one of the latest patches fixed that issue? Can you directly point to one? If not I can test the latest nightly builds so that we can figure that out. Good to see such a progress here!
Comment 18 Henrik Skupin (:whimboo) 2012-10-29 06:30:47 PDT
Just to note, the dependent fix has not been landed for Firefox 18.0 yet, so Aurora is still affected.
Comment 19 Scoobidiver (away) 2012-10-29 06:46:44 PDT
(In reply to Henrik Skupin (:whimboo) from comment #18)
> Just to note, the dependent fix has not been landed for Firefox 18.0 yet
Can you add it to the Depends On field?
Comment 20 Henrik Skupin (:whimboo) 2012-10-29 07:24:57 PDT
I don't know of this other landed patch. Please see my comment 17. I hope Randell can help us.
Comment 21 Jason Smith [:jsmith] 2012-10-29 08:46:10 PDT
(In reply to Henrik Skupin (:whimboo) from comment #18)
> Just to note, the dependent fix has not been landed for Firefox 18.0 yet, so
> Aurora is still affected.

Hmm...if this unaffected on Nightly 19, does this imply we should be able to enable crash tests on mozilla central now?

Also - I think a larger decision (during the weekly meeting) needs to be made if we want to do a large spin of aurora uplifts or not to FF 18, cause there's a lot of them. I'm wondering if it's worth it vs. the overhead of maintaining two builds when there's a lot of stability issues we are still fighting through. I'll get more information on this.
Comment 22 Henrik Skupin (:whimboo) 2012-10-29 12:10:21 PDT
(In reply to Jason Smith [:jsmith] from comment #21)
> Hmm...if this unaffected on Nightly 19, does this imply we should be able to
> enable crash tests on mozilla central now?

I don't think so. There are probably some more leaks left which need to be fixed first. But you can try to run the crashtests on mozilla-central. If the leak check passes we would indeed be able to. That would be great!
Comment 23 Henrik Skupin (:whimboo) 2012-12-18 05:18:24 PST
So the threading issue is fixed in Firefox 20 now but we still have a slight increase of memory over time. It's not that much but consistently growing.
Comment 24 Henrik Skupin (:whimboo) 2012-12-30 05:13:27 PST
Can't see see it leaking anymore with the last dependency bug fixed. Good work!

I don't think that we can have a test for this specific issue. It takes too long and not sure how to measure memory and thread leakage. Instead a manual case would probably a good idea.
Comment 25 Jason Smith [:jsmith] 2012-12-30 06:25:41 PST
(In reply to Henrik Skupin (:whimboo) [away 12/21 - 01/01] from comment #24)
> Can't see see it leaking anymore with the last dependency bug fixed. Good
> work!
> 
> I don't think that we can have a test for this specific issue. It takes too
> long and not sure how to measure memory and thread leakage. Instead a manual
> case would probably a good idea.

Manual test cases defined typically don't test for leaks. We might do one-off investigations for leaks though.

In future test runs, I'll probably be running debug builds with leak checking turned on to mitigate for this. That will help mitigate catching issues such as this bug.

Note You need to log in before you can comment on or make changes to this bug.