Closed Bug 1182515 Opened 9 years ago Closed 9 years ago

Intermittent leakcheck | default process: 11322 bytes leaked (CondVar, Dashboard, DataStorage, EventTokenBucket, HttpBaseChannel, ...)(BackstagePass, CondVar, Dashboard, DataStorage, EventTokenBucket, ...)

Categories

(Core :: Networking: WebSockets, defect)

40 Branch
x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox41 --- fixed
firefox42 --- fixed
firefox43 --- fixed
firefox-esr38 --- unaffected

People

(Reporter: RyanVM, Assigned: michal)

References

Details

(Keywords: intermittent-failure, memory-leak, Whiteboard: [Analysis in comment 60])

Attachments

(2 files, 1 obsolete file)

05:20:06 INFO - Leaked URLs:
05:20:06 INFO - ws://localhost/
05:20:06 INFO - https://localhost/
05:20:06 INFO - nsStringStats
05:20:06 INFO - => mAllocCount: 559074
05:20:06 INFO - => mReallocCount: 36939
05:20:06 INFO - => mFreeCount: 559053 -- LEAKED 21 !!!
05:20:06 INFO - => mShareCount: 1225579
05:20:06 INFO - => mAdoptCount: 26685
05:20:06 INFO - => mAdoptFreeCount: 26683 -- LEAKED 2 !!!
05:20:06 INFO - => Process ID: 1307, Thread ID: 140735090425024

05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 CaptivePortalService (120 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 CondVar (40 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 Dashboard (88 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 DataStorage (328 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 2 EventTokenBucket (352 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 HttpBaseChannel (1056 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 LoadInfo (88 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 10 Mutex (320 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 4 ReentrantMonitor (160 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 Service (144 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 2 StringAdopt (2 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 ThirdPartyUtil (32 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 WebSocketChannel (1992 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsAuthURLParser (24 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsCookiePermission (64 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsCookieService (152 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsDNSService (168 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 4 nsDeque (384 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsEffectiveTLDService (88 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 2 nsHttpAuthCache::AppDataClearObserver (64 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsHttpChannel (1416 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsHttpConnectionMgr (312 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsHttpHandler (704 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsHttpRequestHead (88 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsIDNService (120 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsIOService (288 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsJSPrincipals (16 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsLocalFile (192 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 4 nsMainThreadPtrHolder<T> (96 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsNetworkLinkService (72 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsObserverService (88 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsPermissionManager (184 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsPrefBranch (128 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsRandomGenerator (40 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsSiteSecurityService (64 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsSocketTransportService (272 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 2 nsStandardURL (496 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 21 nsStringBuffer (168 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 11 nsTArray_base (88 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 2 nsThread (480 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsTimerImpl (96 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUnicodeNormalizer (24 bytes)
05:20:06 INFO - TEST-INFO | leakcheck | default process: leaked 7 nsWeakReference (224 bytes)
05:20:06 WARNING - TEST-UNEXPECTED-FAIL | leakcheck | default process: 11322 bytes leaked (CaptivePortalService, CondVar, Dashboard, DataStorage, EventTokenBucket, ...)
Flags: needinfo?(valentin.gosu)
I don't think this is a new leak, just that the captive portal happens to be first in the list. Clearing the ref when observing the shutdown notification should remove the captive portal service from the list.
Attachment #8632780 - Flags: review?(mcmanus)
Assignee: nobody → valentin.gosu
Status: NEW → ASSIGNED
Flags: needinfo?(valentin.gosu)
Comment on attachment 8632780 [details] [diff] [review]
clear reference to captive portal service when observing xpcom-shutdown

Review of attachment 8632780 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsIOService.cpp
@@ +1403,5 @@
>  
>          if (mCaptivePortalService) {
>              static_cast<CaptivePortalService*>(mCaptivePortalService.get())->Stop();
>          }
> +        mCaptivePortalService = nullptr;

might as well move the null assignment into the if (!null) block as long as we have already done the check.
Attachment #8632780 - Flags: review?(mcmanus) → review+
Attachment #8632780 - Attachment is obsolete: true
Looks like the patch just altered the leak signature. Also, bug 1128662 looks possibly-related (similar signature and platform). Though this is obviously way more frequent.

I'm also trying to narrow this down on inbound to get a better regression range.
See Also: → 1128662
Summary: Intermittent leakcheck | default process: 11322 bytes leaked (CaptivePortalService, CondVar, Dashboard, DataStorage, EventTokenBucket, ...) → Intermittent leakcheck | default process: 11322 bytes leaked (CondVar, Dashboard, DataStorage, EventTokenBucket, HttpBaseChannel, ...)(BackstagePass, CondVar, Dashboard, DataStorage, EventTokenBucket, ...)
Ehsan, the pile o' retriggers below is pointing at bug 1180448 as the regressor here. Can you please take a look?
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=os%20x%2010.6%20debug%20mochitest-browser-chrome&tochange=272a2cde6bd1&fromchange=fa77180112f7
Flags: needinfo?(ehsan)
Keywords: leave-open
Yes, will do!
The most interesting observation here is that the thing that is leaking is the observer service.  All of the other things are either nsIObservers, or are held alive by another one of these objects that is an nsIObserver.

This kind of suggests that my uninit() method may not be running, for some reason.  I'm testing some hypotheses on the try server, but in the mean time, Mike, have you ever seen something like this?
Flags: needinfo?(mconley)
Depends on: 1184395
Depends on: 1184397
Progress report:

I have three patches on the try server trying various different things:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=307fd412343e
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d63a733f0f2
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2a9aed771458

So far, out of 500+ runs of mochitest-bc1 on OSX 10.6 (that number is _not_ an exaggeration!!) I have managed to reproduce this exactly _zero_ times.  That is not encouraging.  But furthermore, it seems like this happens in other mochitest-bc chunks too, and on other platforms too.  I have re-triggered bc2/3 runs on all of those try pushes but I am losing faith in ever being able to reproduce this on the try server.

Looking at the objects that have been leaking more closely, nsCookieService and nsPermissionManager also hold the observer service alive, which makes it difficult to guess which object is the root of the leaked subgraph.  I have posted fixes for those in the dependencies of this bug, and once they land, we should see how the signature of these leaks change.  If one of these objects is the root of the leaked graph, the observer service and many of the existing leaked observers should disappear from the list once those patches land!
Note to self: there is an interesting leaked URL: "ws://localhost/" in <https://treeherder.mozilla.org/logviewer.html#?job_id=11776247&repo=mozilla-inbound> which seems to be used by only one test in the tree: http://mxr.mozilla.org/mozilla-central/source/browser/components/loop/test/mochitest/browser_loop_fxa_server.js

Not sure how relevant this is yet...
This run <http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/eakhgari@mozilla.com-9d63a733f0f2/try-macosx64-debug/try_snowleopard-debug_test-mochitest-browser-chrome-2-bm108-tests1-macosx-build2197.txt.gz> from <https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d63a733f0f2> shows that we call my init() and uninit() function an equal number of times successfully when the leak happens, which almost definitely clears the name of bug 1180448 here, but this bug is now bothering me, so I would like to dig in some more.

Joel, you have some magical tool that can retrigger test runs on inbound/etc in order to find what introduced an orange.  Can you please take that tool and run it against this bug?  Note that you probably want to trigger all mochitest-bc runs, not just bc1, as I was mistakenly doing before...

Thanks!
Flags: needinfo?(jmaher)
(In reply to Ehsan Akhgari (not reviewing patches, not reading bugmail, needinfo? me!) from comment #45)
> This run
> <http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/eakhgari@mozilla.
> com-9d63a733f0f2/try-macosx64-debug/try_snowleopard-debug_test-mochitest-
> browser-chrome-2-bm108-tests1-macosx-build2197.txt.gz> from
> <https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d63a733f0f2> shows
> that we call my init() and uninit() function an equal number of times
> successfully when the leak happens, which almost definitely clears the name
> of bug 1180448 here, but this bug is now bothering me, so I would like to
> dig in some more.
> 

Can confirm.
Flags: needinfo?(mconley)
Depends on: 1185515
Depends on: 1185986
Depends on: 1185988
OK, I finally figured out what is happening here.

I noticed that the test log includes only two leaked URLs, one of which is ws://localhost.  Some quick investigation showed that the one of the few browser-chrome tests in the tree that is using this URL is browser/components/uitour/test/browser_UITour_loop.js.  Looking at the log of one of these intermittent failures, you may notice the following:

16:33:38     INFO -  console.info: Loop: PushHandler: attempt to open websocket to PushServer:  ws://localhost/
16:33:38     INFO -  [1172] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /builds/slave/try-m64-d-00000000000000000000/build/src/netwerk/protocol/websocket/WebSocketChannel.cpp, line 2545
16:33:38     INFO -  JavaScript error: resource:///modules/loop/MozLoopPushHandler.jsm, line 73: NS_ERROR_NOT_AVAILABLE: Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIWebSocketChannel.asyncOpen]

And, of the leaked objects is WebSocketChannel.  The code in question here is <https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/websocket/WebSocketChannel.cpp#2542>.  There, the call to GenerateRandomBytes is failing which comes from here <https://dxr.mozilla.org/mozilla-central/source/security/manager/ssl/nsRandomGenerator.cpp#25> indicating that the object has been shut down.  To test my theory, I pushed a patch to the try server to make this intermittent failure of GenerateRandomBytes a persistent one, and as you can see here <https://treeherder.mozilla.org/#/jobs?repo=try&revision=a88ce3d5d8f3>, that turns this intermittent failure into a nice 100% reproducible one on OSX bc1.  I would expect further debugging as to why we end up leaking the WebSocket channel should be straightforward for someone who knows that code.

As you can notice in the try link above, bc2 and bc3 test failures did not reproduce with this injected fault.  I expect that we end up leaking the WebSocket channel object in a variety of runtime error conditions, and my guess is that we're hitting other instances of runtime failures on bc2 and bc3, but I have not investigated further.

Patrick, do you mind forwarding this to the right people in the Necko team, please?  Thanks!
Assignee: valentin.gosu → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(ehsan) → needinfo?(mcmanus)
Component: Networking → Networking: WebSockets
Whiteboard: [Analysis in comment 60]
michal can you work this out? Ehsan has put a lot of work into chasing this leak...
Flags: needinfo?(mcmanus) → needinfo?(michal.novotny)
Assignee: nobody → michal.novotny
Flags: needinfo?(michal.novotny)
Attached patch fixSplinter Review
There is a reference cycle between WebSocketChannel and nsHttpChannel. It is broken in WebSocketChannel::StopSession or WebSocketChannel::OnStopRequest, but none of them is called when WebSocketChannel::AsyncOpen fails.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=091a6e8c7cb6
Attachment #8649460 - Flags: review?(mcmanus)
Comment on attachment 8649460 [details] [diff] [review]
fix

Review of attachment 8649460 [details] [diff] [review]:
-----------------------------------------------------------------

thanks!

::: netwerk/protocol/websocket/WebSocketChannel.cpp
@@ +3279,5 @@
> +        mWebSocketChannel->mHttpChannel = nullptr;
> +      }
> +    }
> +
> +    WebSocketChannel* mWebSocketChannel;

still using type *var around here..
Attachment #8649460 - Flags: review?(mcmanus) → review+
Can we please get this landed?
Flags: needinfo?(michal.novotny)
Keywords: leave-open
Flags: needinfo?(michal.novotny)
https://hg.mozilla.org/mozilla-central/rev/8f626f6be974
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
Thanks for getting to the bottom of this! Can you please nominate this patch for Aurora/Beta approval too? :)
Flags: needinfo?(michal.novotny)
Comment on attachment 8649460 [details] [diff] [review]
fix

Approval Request Comment
[Feature/regressing bug #]: unknown
[User impact if declined]: memory leak
[Describe test coverage new/current, TreeHerder]: should be covered by existing mochitests
[Risks and why]: low, this is a small change that just cleans up member variables on failure
[String/UUID change made/needed]: none
Flags: needinfo?(michal.novotny)
Attachment #8649460 - Flags: approval-mozilla-beta?
Attachment #8649460 - Flags: approval-mozilla-aurora?
Comment on attachment 8649460 [details] [diff] [review]
fix

This is an intermittent failure and a mem leak fix. The patch seems safe enough to uplift to Aurora42 and Beta41.
Attachment #8649460 - Flags: approval-mozilla-beta?
Attachment #8649460 - Flags: approval-mozilla-beta+
Attachment #8649460 - Flags: approval-mozilla-aurora?
Attachment #8649460 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: