Closed Bug 1325215 Opened 3 years ago Closed 3 years ago

Linux32 crashtests perma leaking the world

Categories

(Core :: Web Audio, defect, P1)

52 Branch
Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox50 --- unaffected
firefox51 --- unaffected
firefox52 + fixed
firefox53 + fixed

People

(Reporter: ahal, Assigned: padenot)

References

Details

(Keywords: memory-leak, regression, Whiteboard: [MemShrink:P1])

Attachments

(2 files, 2 obsolete files)

https://archive.mozilla.org/pub/firefox/try-builds/ahalberstadt@mozilla.com-46239c8ae40ed066d6ed784a1d91a0811c71f7df/try-linux-debug/try_ubuntu32_vm-debug_test-crashtest-bm04-tests1-linux32-build295.txt.gz

A bug was recently found for suites using both structured logs and mozleak where leaks failed to turn jobs orange. In the meantime, a leak slipped in causing linux crashtests to leak the world (see above log). This is currently permafail on both central and aurora (but not yet beta).

The fix to the test infrastructure is landing in bug 1325148. As part of that bug, I will be temporarily disabling leak checks for crashtest. This will allow us to land the fix sooner and prevent further leaks from slipping past reftest and jsreftest.

Once this bug is fixed, leak checking in crashtests can be turned back on.
OS: Unspecified → Linux
This also exists on aurora, it is not on beta.
Version: unspecified → 52 Branch
Crap, sorry I messed up the bisection. I didn't realize this leak was only happening on linux32 (not linux64), and the linux32 crashtests need to be backfilled. So you guys are likely off the hook, and I'll post again when the backfill has finished running.
Flags: needinfo?(sunfish)
Flags: needinfo?(luke)
The real culprit is:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=ace491cb598e8767a76e817f111e534fda405518&filter-searchStr=linux%20debug%20crashtest

Alex, sorry this went uncaught for so long. The test harnesses have been failing to report leaks for a long time (see my firefox-dev post or bug 1325148). But I believe one of your patches in this push introduced this memory leak (see linked log in comment 0).

If you'd like to see the failure on a try push, simply backout https://hg.mozilla.org/mozilla-central/rev/90dd139d7578 alongside your push. Please let me know if you need more information.
Flags: needinfo?(achronop)
Summary: Linux crashtests perma leaking the world → Linux32 crashtests perma leaking the world
Blocks: 1286041
[Tracking Requested - why for this release]: Leaks are bad.
Hmm.  Looking at AudioContext, it's got all sorts of strong-ref members it's not cycle-collecting (e.g. mDecodeJobs, mPromiseGripArray, mBasicWaveFormCache).

The early return added in the push above in OnStateChanged may be preventing us from reaching the mPromiseGripArray.RemoveElement() call, and then the lack of cycle collection could be causing us to leak...

Is there a reason those various things are not cycle collected?
Flags: needinfo?(ehsan)
(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #7)
> Hmm.  Looking at AudioContext, it's got all sorts of strong-ref members it's
> not cycle-collecting (e.g. mDecodeJobs, mPromiseGripArray,
> mBasicWaveFormCache).

mBasicWaveFormCache holds an object that cannot participate in cycles, but the other two are definitely bugs.

> The early return added in the push above in OnStateChanged may be preventing
> us from reaching the mPromiseGripArray.RemoveElement() call, and then the
> lack of cycle collection could be causing us to leak...

Given that push, this is indeed the most plausible reason.  I'll write a patch.
Assignee: nobody → ehsan
Component: General → Web Audio
Flags: needinfo?(ehsan)
Andrew, what's the process for testing a patch for one of these bugs?
Flags: needinfo?(ahalberstadt)
Probably back out https://hg.mozilla.org/mozilla-central/rev/90dd139d7578 and see if things fail with/without the patch?
Comment on attachment 8822500 [details] [diff] [review]
Ensure that all AudioContext members that need to participate in cycle collection do so

r=me but I have low confidence that I'd catch issues like someone on another thread having a ref to something in mPromiseGripArray (its ostensible job is to enable that, right?) and us pulling the rug out from under it.  At the very least, it's worth a comment for mPromiseGripArray explaining what the ownership model is and why clearing it is safe....
Attachment #8822500 - Flags: review?(bzbarsky) → review+
Comment on attachment 8822500 [details] [diff] [review]
Ensure that all AudioContext members that need to participate in cycle collection do so

When we transfer the promise to another thread, we cast it to void* and then cast it back to Promise* when we receive it again on the main thread.  But I'll ask another review from karlt.
Attachment #8822500 - Flags: review?(karlt)
Yes, backout https://hg.mozilla.org/mozilla-central/rev/90dd139d7578 to test it. This is also the only leak I've seen in reftests, so you can also push the backout along with this patch (or just land this and I'll do it later).
Flags: needinfo?(ahalberstadt)
(In reply to :Ehsan Akhgari from comment #14)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=608191d559885134ede4dfdb8fbe082b2f481128

The patch fixes the leak! \o/
Comment on attachment 8822500 [details] [diff] [review]
Ensure that all AudioContext members that need to participate in cycle collection do so

AsyncDecodeWebAudio creates a MediaDecodeTask with a unaccounted/bare reference
to a WebAudioDecodeJob in mDecodeJobs.  The MediaDecodeTask is dispatched to
another thread and then passed back to the main thread.
I don't see anything else keeping the WebAudioDecodeJob alive and so AFAICS
the reference in mDecodeJobs cannot be cleared by CC.

Similarly, allowing the AudioContext and WebAudioDecodeJob to be CCed could
mean the callback and promise are not called.  Is there a reference I'm
missing?

The MediaDecodeTask should complete in finite time and call
WebAudioDecodeJob::OnSuccess/Failure(), which will clear the reference in
mDecodeJobs.  That means there should be no need to traverse this link, but I
don't know why WebAudioDecodeJob is a CCed class and I don't know what stops
the script running while the page is in BF cache.

I've run out of time to look at mPromiseGripArray, sorry.  Does it have similar
issues?  Feel free to ask padenot to review that as I won't be back for a few
days and I'm not familiar with this code.  Similarly, if you can convince
someone else that there is another untraversed reference to the
WebAudioDecodeJob somewhere.
Attachment #8822500 - Flags: review?(karlt) → review-
Flags: needinfo?(achronop)
Tracking 52/53+ since this is a perma leak.
(In reply to Karl Tomlinson (:karlt) from comment #17)
> Comment on attachment 8822500 [details] [diff] [review]
> Ensure that all AudioContext members that need to participate in cycle
> collection do so
> 
> AsyncDecodeWebAudio creates a MediaDecodeTask with a unaccounted/bare
> reference
> to a WebAudioDecodeJob in mDecodeJobs.  The MediaDecodeTask is dispatched to
> another thread and then passed back to the main thread.
> I don't see anything else keeping the WebAudioDecodeJob alive and so AFAICS
> the reference in mDecodeJobs cannot be cleared by CC.
> 
> Similarly, allowing the AudioContext and WebAudioDecodeJob to be CCed could
> mean the callback and promise are not called.  Is there a reference I'm
> missing?

No, you're completely right.

> The MediaDecodeTask should complete in finite time and call
> WebAudioDecodeJob::OnSuccess/Failure(), which will clear the reference in
> mDecodeJobs.  That means there should be no need to traverse this link, but I
> don't know why WebAudioDecodeJob is a CCed class and I don't know what stops
> the script running while the page is in BF cache.

The reason why this class is CCed is that it used to have a |JS::Heap<JSObject*>| member (see bug 891986) but that seems unnecessary now.  I think we should probably consider removing the CC stuff and just hold on to it using a UniquePtr.

> I've run out of time to look at mPromiseGripArray, sorry.  Does it have
> similar
> issues?  Feel free to ask padenot to review that as I won't be back for a few
> days and I'm not familiar with this code.  Similarly, if you can convince
> someone else that there is another untraversed reference to the
> WebAudioDecodeJob somewhere.

For mPromiseGripArray, we use this array to keep the promises alive, but that's not crucial for our observable behavior, that is, it will only be observable that a promise is resolved or rejected if the consumer is already holding a reference to the promise and therefore keeping it alive.  Otherwise it's find for us to not resolve the promise and nobody will notice.  We should probably detect and handle that case correctly in AudioContext::OnStateChanged though.
Attachment #8822500 - Attachment is obsolete: true
(In reply to :Ehsan Akhgari from comment #19)
> > The MediaDecodeTask should complete in finite time and call
> > WebAudioDecodeJob::OnSuccess/Failure(), which will clear the reference in
> > mDecodeJobs.  That means there should be no need to traverse this link, but I
> > don't know why WebAudioDecodeJob is a CCed class and I don't know what stops
> > the script running while the page is in BF cache.
> 
> The reason why this class is CCed is that it used to have a
> |JS::Heap<JSObject*>| member (see bug 891986) but that seems unnecessary
> now.  I think we should probably consider removing the CC stuff and just
> hold on to it using a UniquePtr.

I did that in bug 1328422.
> that is, it will only be observable that a promise is resolved or rejected
> if the consumer is already holding a reference to the promise and therefore keeping it alive.

That is not true at all.  Say the consumer does:

  yourObject.getPromise().then(f1, f2);

The consumer is not holding a reference to the promise returned by getPromise.  But it _can_ tell whether the promise is resolved or rejected by which of f1 and f2 is called.  The thing keeping the promise alive is whatever is planning to resolve or reject it.

> Otherwise it's find for us to not resolve the promise and nobody will notice

The people whose callbacks don't get called will notice!
Flags: needinfo?(ehsan)
Rank: 21
Priority: -- → P2
This can only happen in a window that has FreeInnerObjects called on it, no?  Are such callbacks expected to be called in that situation?
Flags: needinfo?(ehsan) → needinfo?(bzbarsky)
> This can only happen in a window that has FreeInnerObjects called on it, no? 

Which "this"?

> Are such callbacks expected to be called in that situation?

Unclear.  Per spec, yes, but the spec is bonkers.  In Gecko, we avoid calling callbacks for a promise whose global IsDying(), but call them otherwise, I believe.
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #24)
> > This can only happen in a window that has FreeInnerObjects called on it, no? 
> 
> Which "this"?

The callbacks not getting called.

> > Are such callbacks expected to be called in that situation?
> 
> Unclear.  Per spec, yes, but the spec is bonkers.  In Gecko, we avoid
> calling callbacks for a promise whose global IsDying(), but call them
> otherwise, I believe.

Hmm, so it's _possible_ for this to happen after FreeInnerObjects.  :(

I don't have a good idea on what to do any more then.
Assignee: ehsan → nobody
Note that putting promises in mPromiseGripArray isn't the only way to keep them alive, but still the problem of us having a reference that creates a cycle but we cannot declare to the cycle collector remains, even if we resort to manual AddRefs/Releases...
> The callbacks not getting called.

Ah, because that's the only way to get into the "mAudioContextState == AudioContextState::Closed" state?  

Would it make sense to just reject all the promises when entering that state, and block more promises being added to the list?
More to the point, shouldn't the web audio spec define the behavior for what happens with these promises on navigation?
Yes, I think this is definitely underdefined right now.
Whiteboard: [MemShrink]
(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #27)
> > The callbacks not getting called.
> 
> Ah, because that's the only way to get into the "mAudioContextState ==
> AudioContextState::Closed" state?  

You can enter "AudioContextState::Closed" when calling `close` on the `AudioContext`. This is also called from FreeInnerObject (FreeInnerObject -> AudioContext::Shutdown -> AudioContext::Close), but the promise returned is ignored.

> Would it make sense to just reject all the promises when entering that
> state, and block more promises being added to the list?

We already do the second part: per spec, calling `close` on an `AudioContext` set a main thread flag on the `AudioContext` so that authors cannot call `suspend`, `resume` or `close` again on it (it throws). This blocks more promise from being added to the list.

Then in theory, all requests and their associated promises are being processed linearly on the other thread, until processing the `close`, which is the last one in the queue, since no other `Promise` can be queue after (apart from an implementation detail in bug 1285290). When a request has been processed on the other thread, a runnable is posted to the main thread, we find the promise associated with it, and resolve it.

However in this context (Calling `close` on an `AudioContext` that has other promises in fligh), it is probably reasonable to reject all  promises that are in flight (we would store them on the AudioContext) and proceed with closing the context, resolving the `Promise` returned by the `close` method last.

This would require a spec change but not too crazy, and is what Chromium is doing apparently. I don't quite know for sure what we would do with the pointers we sent to the other thread, we can probably check if there are still registered as "in-flight promise", and if not, do nothing. We can also send something else that are not pointers if we want extra safety (probably using CoatCheck [0]).

Now, if we are navigating away from the page, what usually happens with `Promise`s that are in flight? Do usually we drop them? There is no time to execute the `then()` functions, right ? Maybe we can just spec that navigating away from a page that has active (i.e. non already closed) `AudioContext` close them, but drops the promise instead of rejecting them. Is that legal?

Is there a precedent on the web platform to write prose like this? I tried to have a look at fetch (that I know is promise heavy), but I haven't managed to find a behaviour I could use as an inspiration, but I must be looking in the wrong location.


[0]: https://dxr.mozilla.org/mozilla-central/source/dom/media/systemservices/MediaUtils.h?q=CoatCheck&redirect_type=direct#208
Flags: needinfo?(ehsan)
Flags: needinfo?(bzbarsky)
> what usually happens with `Promise`s that are in flight?

Unclear.

> There is no time to execute the `then()` functions, right ?

Unclear.  Per spec, the interaction of promise processing with navigation is somewhere between "batshit crazy" (i.e. everything is alive forever, navigation can't clean anything up) and "not defined", depending on how you choose to look at it...

> Is there a precedent on the web platform to write prose like this?

I'm not sure.  Paging Domenic, who's been more involved in promisy bits...
Flags: needinfo?(bzbarsky) → needinfo?(d)
I'm wondering if we should consider backing out bug 1286041 from Aurora at this point. I checked and it nearly does so cleanly, so hopefully Alex could cook that up with minimal effort. This bug doesn't seem to be moving in a "We're going to have a safe to uplift fix" direction :(
Flags: needinfo?(achronop)
(In reply to Paul Adenot (:padenot) from comment #30)
> > Would it make sense to just reject all the promises when entering that
> > state, and block more promises being added to the list?
> 
> We already do the second part: per spec, calling `close` on an
> `AudioContext` set a main thread flag on the `AudioContext` so that authors
> cannot call `suspend`, `resume` or `close` again on it (it throws). This
> blocks more promise from being added to the list.
> 
> Then in theory, all requests and their associated promises are being
> processed linearly on the other thread, until processing the `close`, which
> is the last one in the queue, since no other `Promise` can be queue after
> (apart from an implementation detail in bug 1285290). When a request has
> been processed on the other thread, a runnable is posted to the main thread,
> we find the promise associated with it, and resolve it.

That sounds reasonable to me, and that is not the cause of this bug IIUC.

> However in this context (Calling `close` on an `AudioContext` that has other
> promises in fligh), it is probably reasonable to reject all  promises that
> are in flight (we would store them on the AudioContext) and proceed with
> closing the context, resolving the `Promise` returned by the `close` method
> last.

I prefer the model of resolving promises in a way that depends only on the
state of the AudioContext at the time the promise is created, rather than
having the resolution depend on subsequent async operations, leading to races.

> Now, if we are navigating away from the page, what usually happens with
> `Promise`s that are in flight? Do usually we drop them? There is no time to
> execute the `then()` functions, right ? Maybe we can just spec that
> navigating away from a page that has active (i.e. non already closed)
> `AudioContext` close them, but drops the promise instead of rejecting them.
> Is that legal?

That sound fine to me, if others think it is reasonable.
I was summoned to talk about promises and navigation, so here goes:

In general, the platform allows code from navigated-away-from frames to run, as long as the event loop still exists. (For example, you can save a closure from an iframe, then navigate that iframe, and the closure will still work when called.) Conceptually, that's how promises work too. A closure is saved to the event loop, and can still be executed even after navigation. Only tearing down the entire event loop will stop it from executing. So, you can definitely execute then() callbacks after navigation; there's no issue there.

For specific cases, the platform will prevent certain closures from executing on navigation. This is often done in the "unloading document cleanup steps". HTML itself uses these to clear the "list of active timers", thus stopping any setTimeout-enqueued closures from executing after navigation.

Another example is Fetch. Although it is currently buggy (per https://github.com/whatwg/fetch/issues/416) and I believe underdefined (in that nothing on the platform actually uses https://fetch.spec.whatwg.org/#concept-fetch-group-terminate from what I can tell), the intent is that navigating will reject all in-flight promises. This will immediately run their rejection handlers. (Although it's possible that they might execute after the actual document destruction due to event loop vagaries; I'm not sure.)

So in general, you first need to ask whether you actually want to prevent these promises from settling after navigation or not. If an AudioContext is usable even in a navigated-away-from document in general, then there's no reason to prematurely cease any ongoing promise activity. I don't see "active document" anywhere Ctrl+Fing through https://webaudio.github.io/web-audio-api/, or any "unloading document cleanup steps" that would close an AudioContext when the documents unloads, so maybe this is the case you've landed in here.

If you *do* want to prevent web audio from working after navigation, then you'll probably want to insert a bunch of "active document" checks throughout the web audio spec, and then add "unloading document cleanup steps" that reject any in-flight promises and perform any necessary shutdown.

Hope this helps...
Flags: needinfo?(d)
To avoid spamming this thread, I will follow up Ryan's request in the original bug (1286041).
Flags: needinfo?(achronop)
P1, this has disabled leak checking on crash tests. Is there a reason we're only backing out on Aurora?
Flags: needinfo?(ryanvm)
Whiteboard: [MemShrink] → [MemShrink:P1]
I wanted to leave the door open for a "real" fix based on the work already done in this bug once the spec-related questions get sorted out. And the leak checking disabling is only on Linux, so we're not completely devoid of coverage. That said, I'm not opposed to bug 1330372 landing on trunk too to stop the bleeding and repurposing this bug for a "real" fix if that's what the media folks prefer.
Flags: needinfo?(ryanvm)
Paul - please see comment 34.  And what's your opinion on backing out, or leaving it disabled (on linux)?

In any case, you know more about this than I.
Flags: needinfo?(padenot)
(In reply to Karl Tomlinson (:karlt) from comment #33)

> I prefer the model of resolving promises in a way that depends only on the
> state of the AudioContext at the time the promise is created, rather than
> having the resolution depend on subsequent async operations, leading to
> races.

I agree. Let's continue doing that, then.

> > Now, if we are navigating away from the page, what usually happens with
> > `Promise`s that are in flight? Do usually we drop them? There is no time to
> > execute the `then()` functions, right ? Maybe we can just spec that
> > navigating away from a page that has active (i.e. non already closed)
> > `AudioContext` close them, but drops the promise instead of rejecting them.
> > Is that legal?
> 
> That sound fine to me, if others think it is reasonable.

I think we will settle on rejecting the promise on navigation.
Flags: needinfo?(padenot)
Thanks all, this is much clearer now.

(In reply to Domenic Denicola from comment #34)
> If you *do* want to prevent web audio from working after navigation, then
> you'll probably want to insert a bunch of "active document" checks
> throughout the web audio spec, and then add "unloading document cleanup
> steps" that reject any in-flight promises and perform any necessary shutdown.

The behaviour here is that if we're navigating away from a document, all sound should be stopped and we should stop everything as far as web audio is concerned. I've filed an issue [0] against the spec, and I've assigned myself.

Now, the plan of action to solve this is as follow:
1. Wait for a rough consensus from the WG (we'll probably settle on rejecting the promises when navigating away from the doc, so we can start doing step 2 now).
2. Write a gecko patch that rejects the promises when the AudioContext is being shutdown. It will do the following:
  1. Reject all the promises when the document is unloaded.
  2. Whenever we want to resolve a promise on the main thread because we've just finished an operation on the audio thread, don't panic if we can't find it in the `mPromiseGripArray`, since it can be that it has already been rejected.
  3. Land something along the lines of ehsan's patch (since I think step 2.2 will bitrot it) and restore checking for leaks
  4. Clean up all this, using `CoatCheck` [1] instead of our `void*` hack so it's less dangerous, and add comments about all this more.

I'll probably do 2.4 in a separate bug so we can close this bug and restore leak checking as soon as possible.

[0]: https://github.com/WebAudio/web-audio-api/issues/1139
[2]: http://searchfox.org/mozilla-central/source/dom/media/systemservices/MediaUtils.h#271
Assignee: nobody → padenot
Rank: 21 → 12
Priority: P2 → P1
Ehsan's patch, without the promise rejection bit, that comes in the next patch,
because it was bitrot.
This rejects the promises on shutdown, and has the part about your other patch,
but rebased on top of the rest.
Attachment #8827467 - Flags: review?(ehsan)
Attachment #8823403 - Flags: review?(padenot) → review+
This is green on try when re-triggering a number of times, and with the leak-check restored.
Comment on attachment 8827467 [details] [diff] [review]
Reject promises in flight when shutting down AudioContexts

Review of attachment 8827467 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/media/webaudio/AudioContext.cpp
@@ +638,5 @@
>    }
>  
> +  for (auto p : mPromiseGripArray) {
> +    p->MaybeReject(NS_ERROR_DOM_INVALID_STATE_ERR);
> +  }

I think you also need to clear the whole array here.  r=me with that.
Attachment #8827467 - Flags: review?(ehsan) → review+
Flags: needinfo?(ehsan)
Pushed by paul@paul.cx:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c01f7957ff52
Ensure that all AudioContext members that need to participate in cycle collection do so; r=padenot
https://hg.mozilla.org/integration/mozilla-inbound/rev/353266dd63ed
Reject promises in flight when shutting down AudioContexts.  r=ehsan
Duplicate of this bug: 1330372
Please request Aurora approval on this when you're comfortable doing so.
Flags: needinfo?(padenot)
Attachment #8823403 - Attachment is obsolete: true
Blocks: 1332283
(In reply to Ryan VanderMeulen [:RyanVM] from comment #49)
> Please request Aurora approval on this when you're comfortable doing so.

Beta, now. Quite an easy patch, though.
Flags: needinfo?(padenot)
Comment on attachment 8827465 [details] [diff] [review]
Ensure that all AudioContext members that need to participate in cycle collection do so

Approval Request Comment
[Feature/Bug causing the regression]: bug 1286041
[User impact if declined]: unnecessary test failure, inability to monitor leak count
[Is this code covered by automated tests?]: yes, this allows us to restore leak checking
[Has the fix been verified in Nightly?]: yes, RyanVM tells us it's better
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: All three revision pushed in this bug
[Is the change risky?]: No
[Why is the change risky/not risky?]: ehsan has looked at the CC stuff, and I've look at the promise stuff, we believe we got it right. The leak is now gone with this patch, and we've restored the leak check, that is still green.
[String changes made/needed]: none
Attachment #8827465 - Flags: approval-mozilla-beta?
Attachment #8827465 - Flags: approval-mozilla-aurora?
Comment on attachment 8827465 [details] [diff] [review]
Ensure that all AudioContext members that need to participate in cycle collection do so

This landed before the uplift, so only Beta needs it.
Attachment #8827465 - Flags: approval-mozilla-aurora?
Comment on attachment 8827465 [details] [diff] [review]
Ensure that all AudioContext members that need to participate in cycle collection do so

fix memory leak, beta52+ (all three patches), should be in b2
Attachment #8827465 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Setting qe-verify- per Comment 51.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.