Intercepted COM calls leak memory

RESOLVED FIXED in Firefox 59

Status

()

defect
RESOLVED FIXED
Last year
Last year

People

(Reporter: Jamie, Assigned: Jamie)

Tracking

({access})

unspecified
mozilla60
All
Windows
Points:
---
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox59 fixed, firefox60 fixed)

Details

Attachments

(1 attachment)

STR:
1. Start NVDA and Firefox.
2. Open any web page.
3. Get its content process id (e.g. from the tool tip on the tab).
4. Check the heap unclassified memory usage for that process in about:memory.
5. Focus the document.
6. Press NVDA+control+z to open the NVDA Python console.
7. Enter the following:
for i in xrange(50000): r = nav.IAccessibleObject.role()
8. Again check the heap unclassified memory usage for the content process in about:memory.
Expected: It should not change noticeably.
Actual: it increases by several megabytes.

Any COM call, even one that doesn't marshal any interfaces or strings (or just returns E_NOTIMPL), leaks memory.

I think I've tracked this down to MainThreadInvoker. As of bug 1384336, both a Gecko runnable and an APC are executed in the main thread to dispatch the call. From the relevant commit message (4786ec6700d05b64):

> When removing our Windows message loop pumping code in the content process, a11y code on the MTA thread must have some way to wake up the main thread. The main thread could be blocked either on a conditional variable waiting for a Gecko event, or it could be blocked waiting on a Windows HANDLE in IPC code (doing a sync message send). In the former case, we wake it up by posting an event to the main thread. In the latter case, we continue to use the asynchronous procedure call mechanism.

Both the Gecko runnable and the APC are queued in either case. A flag is used to ensure that the call only runs once. If the flag is set, the second run does nothing. We do an AddRef for both of them, and each one is responsible for releasing its reference.

It seems that now, the APC never executes, at least not most of the time. (It seems to execute while Firefox is starting, but not for client calls thereafter.) The result is that the references for the APCs that don't execute don't get released. So, for each of these calls, we leak a SyncRunnable.

My evidence for this is based on a simple patch which printfs messages in three places:
1. After using MainThreadInvoker to dispatch a call to the main thread in MainThreadHandoff::OnCall.
2. When destroying a MainThreadInvoker SyncRunnable (in ~SyncRunnable).
3. Just before releasing the SyncRunnable in MainThreadInvoker::MainThreadAPC.

For the majority of calls after Firefox starts, I only see the first message, not the other two. They probably do run before Firefox exits (and there are doubtless other things that would trigger them), but they certainly don't run for a very long time in most cases.

The reason for the APC is explained in bug 1273635:

> If the client's injected DLL makes a synchronous COM call into the content process, but the content process is waiting on a sync IPDL message that is enqueued in the parent, the COM call will never make it onto the main thread.

So, I guess there isn't much sync ipdl these days?

I also see that bug 1384336 (4786ec6700d05b64) removes some code which "clears out any pending APCs" using NtTestAlert. That code is no longer relevant because it's in the Windows message pumping stuff, but I wonder whether we now need something similar with respect to the Gecko event loop (nsThread::ProcessNextEvent, mozilla::ThreadEventQueue:Wait or similar).

Two other possible solutions come to mind:
1. Find some way to eliminate the need for the APCs without causing deadlocks.
2. Find some way to avoid the APC (or at least release its reference) if we know it's not going to be needed.
Aaron, any chance you might be able to point me in the right direction here? :)
Flags: needinfo?(aklotz)
Ah shit.
Flags: needinfo?(aklotz)
What if we made it so that the runnable doesn't directly Run() the event, but does so indirectly via NtTestAlert()?
Assignee: nobody → jteh
Thanks for the inspiration, Aaron. :)

With this patch, the leak with the STR I provided in comment 0 goes away. Unfortunately, we still leak when refreshing NVDA virtual buffers with the handler enabled. With the handler disabled, we don't leak. That leak is unrelated to this one, so I'll file a separate bug for it.
Comment on attachment 8953350 [details]
Bug 1440257: Ensure that mscom::MainThreadInvoker gets cleaned up very soon after it finishes executing.

https://reviewboard.mozilla.org/r/222646/#review229686

Mostly good but there is one significant issue, so I'll need to see this again.

::: ipc/mscom/MainThreadInvoker.cpp:39
(Diff revision 1)
>  public:
>    explicit SyncRunnable(already_AddRefed<nsIRunnable> aRunnable)
>      : mozilla::Runnable("MainThreadInvoker")
>      , mRunnable(aRunnable)
> -  {}
> +  {
> +    static bool gotStatics = InitStatics();

Make this const please

::: ipc/mscom/MainThreadInvoker.cpp:52
(Diff revision 1)
>      if (mHasRun) {
> +      // The APC already ran, so we have nothing to do.
> +      return NS_OK;
> +    }
> +    // Run the pending APC in the queue.
> +    sNtTestAlert();

Please add an assertion for sNtTestAlert before calling it.

::: ipc/mscom/MainThreadInvoker.cpp:149
(Diff revision 1)
>      return true;
>    }
>  
>    RefPtr<SyncRunnable> syncRunnable = new SyncRunnable(runnable.forget());
>  
> -  if (NS_FAILED(SystemGroup::Dispatch(
> +  // The main thread could be either blocked on a conditional variable waiting

Nit: s/conditional/condition/

::: ipc/mscom/MainThreadInvoker.cpp:155
(Diff revision 1)
> -                  TaskCategory::Other, do_AddRef(syncRunnable)))) {
> +  // for a Gecko event, or it could be blocked waiting on a Windows HANDLE in
> +  // IPC code (doing a sync message send). In the former case, we wake it by
> +  // posting a Gecko runnable to the main thread. In the latter case, we wake
> +  // it using an APC. However, the latter case doesn't happen very often now
> +  // and APCs aren't otherwise run by the main thread. To ensure the
> +  // SyncRunnable is cleaned up (bug 1440257), we need both to run consistently.

Style nit: This depends on the module, but in general I think we avoid mentioning bug numbers in comments since that information is readily available in the blamelog.

::: ipc/mscom/MainThreadInvoker.cpp:168
(Diff revision 1)
> -  // This ref gets released in MainThreadAPC when it runs.
> -  SyncRunnable* syncRunnableRef = syncRunnable.get();
> -  NS_ADDREF(syncRunnableRef);
> -  if (!::QueueUserAPC(&MainThreadAPC, sMainThread,
> +  // 2. Post a Gecko runnable (which always runs). If the APC hasn't run, the
> +  // Gecko runnable runs it. Otherwise, it does nothing.
> +  if (NS_FAILED(SystemGroup::Dispatch(
> +                  TaskCategory::Other, do_AddRef(syncRunnable)))) {
> -                      reinterpret_cast<UINT_PTR>(syncRunnableRef))) {
>      return false;

This could be a problem if QueueUserAPC succeeded but the Dispatch did not. We didn't AddRef, so by returning false, we're no longer waiting for the SyncRunnable to execute, even though it is still enqueued to the target thread.

I think you should still AddRef() it before QueueUserAPC. Doing so by the location of this comment could make it possible to lose the race with an alertable wait state on the target thread and then we leak that reference.
Attachment #8953350 - Flags: review?(aklotz) → review-
Comment on attachment 8953350 [details]
Bug 1440257: Ensure that mscom::MainThreadInvoker gets cleaned up very soon after it finishes executing.

https://reviewboard.mozilla.org/r/222646/#review230748

Awesome, thanks!
Attachment #8953350 - Flags: review?(aklotz) → review+
Pushed by aklotz@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8702f2b2c648
Ensure that mscom::MainThreadInvoker gets cleaned up very soon after it finishes executing. r=aklotz
https://hg.mozilla.org/mozilla-central/rev/8702f2b2c648
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Comment on attachment 8953350 [details]
Bug 1440257: Ensure that mscom::MainThreadInvoker gets cleaned up very soon after it finishes executing.

Approval Request Comment
[Feature/Bug causing the regression]: a11y on Windows
[User impact if declined]: Memory leaks
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: Yes
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: No
[Why is the change risky/not risky?]: Small, very localized patch, problem is clearly understood.
[String changes made/needed]: None
Attachment #8953350 - Flags: approval-mozilla-beta?
Attachment #8953350 - Flags: approval-mozilla-beta? → approval-mozilla-release?
Comment on attachment 8953350 [details]
Bug 1440257: Ensure that mscom::MainThreadInvoker gets cleaned up very soon after it finishes executing.

Fix for a memory leak, let's uplift to m-r.
Attachment #8953350 - Flags: approval-mozilla-release? → approval-mozilla-release+
Based on comment 11, marking as qe-verify-
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.