Closed Bug 872839 Opened 11 years ago Closed 11 years ago

Intermittent 182, 188, 197, 221, 716, 763, 767 bytes leaked (Mutex, NrIceResolver, PeerConnectionMedia, ReentrantMonitor, nsSocketTransport, nsTArray_base, nsThread)

Categories

(Core :: WebRTC: Signaling, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24
Tracking Status
firefox22 --- fixed
firefox23 --- fixed
firefox24 --- fixed

People

(Reporter: jib, Assigned: jib)

Details

(Keywords: intermittent-failure, memory-leak, Whiteboard: [WebRTC][blocking-webrtc-][qa-])

Attachments

(3 files, 4 obsolete files)

Attached file mozilla-inbound-leak.txt (obsolete) —
At times (over the last week at least), I get this leak fairly persistently when running the crashtests locally:

  ./mach crashtest dom/media/tests/crashtests

Meaning: I can run the test again and again and get the same leaking result. At other times they do not happen at all, and I've never seen them on the try/build servers, only locally.

I want to stress that I ran off of a pull from mozilla-inbound from earlier today, *without* any local patches added (see output), and I was able to verify the leak immediately on the first three runs (I didn't try more, since in my experience, once they happen, they happen persistently).

I'm filing this report preemptively to disprove any relation to patches I'm testing and am about to land.

See attachment (leaks at the bottom).
Attached file mozilla-inbound-leak.txt (obsolete) —
Just the crashtest output.
Attachment #750180 - Attachment is obsolete: true
Whiteboard: [WebRTC][blocking-webrtc-]
I've narrowed it down to 860143.html, which all it does is this:

<head>
  <meta charset="utf-8">
  <title>bug 860143</title>
  <script type="application/javascript">
    function start() {
      var o0 = new window.mozRTCPeerConnection({
        iceServers: [
          {
            url: "turn:AAAAAAAAAAAAAAAAA        AAAAAAAAAAA  AAAAAAAAAAAAA AA AAAAA        AAAAAAAAAAA  AAAAAAAAAAAAAAAA AAAAA        AAAAAAAAAA, AAAAAAAAAAAAAAAA        AAA    AAAAAAAAAAAAAAA"
          }
        ]
      });
      o0.close();
    }
  </script>
</head>

<body onload="start()">

Which makes sense as it causes asynchronous DNS resolution to start (NrIceResolver), with not a lot of time to finish.
Attachment #750182 - Attachment is obsolete: true
Actually, it is much simpler: The 257-character url gets past the validation in JS and into the parsing code in c++ where it fails in nricectx.h:

>    else if (addr.size() < 256) {

Nice one! From there it is a failure to cleanup from a rarely-exercised failure path in PeerConnectionImpl.cpp:

  mMedia = new PeerConnectionMedia(this);

  // Connect ICE slots.
  mMedia->SignalIceGatheringCompleted.connect(this,
    &PeerConnectionImpl::IceGatheringCompleted);
  mMedia->SignalIceCompleted.connect(this, &PeerConnectionImpl::IceCompleted);
  mMedia->SignalIceFailed.connect(this, &PeerConnectionImpl::IceFailed);

  // Initialize the media object.
  if (aRTCConfiguration) {
    IceConfiguration ic;
    res = ConvertRTCConfiguration(*aRTCConfiguration, &ic, aCx);
>   NS_ENSURE_SUCCESS(res, res); // <-- res = NS_ERROR_FAILURE.
    res = mMedia->Init(ic.getStunServers(), ic.getTurnServers());

Unfortunately, mMedia is created but not yet initialized, which is not good since cleanup of it happens to be rather complicated:

  // Forget the reference so that we can transfer it to
  // SelfDestruct().
  mMedia.forget().get()->SelfDestruct();

I'll reorder the code to avoid this, since there's really no need to create mMedia before parsing the url. Patch coming up.

Now why doesn't this leak show up on tinderbox? I see the test run, but no sign of the failure path being exercised. Are crashtest leaks disabled there?
Assignee: nobody → jib
Missed jib in IRC, but here's some more info I was going to say.

jsmith	jib: Dug into the issue you hit btw with Clint
jsmith	jib: Turns out you found a bug in buildbot
jsmith	jib: Or actually how we are using buildbot (aka our scripts)
jsmith	jib: We're apparently not running leak checks for crashtests on OS X 10.8
Good, though I don't think that explains not seeing this on tbpl, since it should have appeared for other OSes. Shutdown of the internal PeerConnectionMedia object is complicated, so it is possible there's a yet to be explored shutdown race here.
Reorders the code to parse the iceServers before creating the inner PeerConnectionMedia object, avoiding leaking on the subset of invalid iceServers that the JS validation code doesn't catch.
Attachment #752456 - Flags: review?(rjesup)
Actually fixes the leak rather than avoiding it.

Further analysis with Randell led us to the real culprit, which was that new PeerConnections weren't being added to the global list of PCs until after pc.initialize succeeded, which meant that if pc.initialize failed, then pc.close didn't get called by the shutdown observer.

Instead, pc.close got called during garbage collection, potentially much later in shutdown with the STS thread and the main thread possibly too far gone to succeed at completing the nsRunnable ping-pong dance in PeerConnectionMedia::SelfDestruct(). We believe this explains the intermittency.
Attachment #752456 - Attachment is obsolete: true
Attachment #752456 - Flags: review?(rjesup)
Attachment #752832 - Flags: review?(rjesup)
Cleaner code and failure paths is still a good idea, though no longer necessary to avoid the leak.
Attachment #752834 - Flags: review?(rjesup)
Attachment #752832 - Flags: review?(rjesup) → review+
Attachment #752834 - Flags: review?(rjesup) → review+
Added a comment. Carrying forward r+ from rjesup.
Attachment #752834 - Attachment is obsolete: true
Attachment #752861 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/f5303482b695
https://hg.mozilla.org/mozilla-central/rev/ec2e0fb43322
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla24
Whiteboard: [WebRTC][blocking-webrtc-] → [WebRTC][blocking-webrtc-][qa-]
Feel free to nominate for Aurora uplift btw.
Whiteboard: [WebRTC][blocking-webrtc-][qa-] → [WebRTC][blocking-webrtc-][qa-][webrtc-uplift]
Comment on attachment 752832 [details] [diff] [review]
Part 1: Puts new PeerConnections on global list earlier, ensuring close is called should PC.initialize fail.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A

User impact if declined: Occasional leaks when PeerConnection initialization fails (generally would be due to misconfiguration)/

Testing completed (on m-c, etc.): on m-c for a while, verified

Risk to taking this patch (and alternatives if risky): very low; the patch  makes sure it's on the list of active connections before doing anything that can fail

String or IDL/UUID changes made by this patch: none
Attachment #752832 - Flags: approval-mozilla-aurora?
Just requesting uplift for part 1, as part two is just code cleanup
Attachment #752832 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Whiteboard: [WebRTC][blocking-webrtc-][qa-][webrtc-uplift] → [WebRTC][blocking-webrtc-][qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: