Bug 1527652 Comment 2 Edit History

Original comment by

on 2019-02-19 11:39:19 PST

## Risk Analysis
There is no chance for user events/scripts to run in the QuotaManager shutdown gap. The code is executing on PBackground where no content runs and content is only able to interact via IPC messages, and no IPC messages are allowed to be processed until after all of this code finishes running and returns control flow to the message loop.

We expect this problem to occur in the following circumstances:
- Content has some DOM Cache API requests in flight.
- One of the following is happening, neither of which are things nefarious content can directly control.
- This one, shutdown under a DEBUG or ASAN build (or similar non-production build).
- The user triggers privacy data clearing/sanitization for the origin without first terminating the globals in question. (I think impacted tabs will be closed, but things are a little tricky with ServiceWorkers, and with races and such, this could happen.)

## Defect Analysis

It's also worth noting that ASAN builds differ from (non-debug) production builds. In production, https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/dom/ipc/ContentChild.cpp#2142 (compiled in due to https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/xpcom/base/nscore.h#172) will have killed the content process during "profile-before-change". This will take down the PBackground channel and cause cleanup due to ActorDestroy invocations, avoiding what we're seeing above which happens during "profile-before-change-qm" which is strictly after "profile-before-change".

But in ASAN/DEBUG shutdown, PBackground continues to exist until it gets cleanly torn down from either end. The DOM Cache implementation makes things difficult for itself here by not just letting the WorkerPrivate completely shutdown and let the PBackground channel be torn down. It tries to keep going so pending requests can be fulfilled and/or actors are explicitly torn down (out of concern for the process being aborted due to a message arriving for a dead actor, something that I believe may no longer be fatal). Which is how we can find ourselves with QuotaManager being the one trying to tear down client-owned PBackground actors. (Note that there are also some races likely involved.)

The other good news is that ASAN is awesome and it helps illuminate the problem here that:
- CacheStreamControlParent::Shutdown() is invoking Send__delete__. The class has an IPC-managed lifecycle which means that the instance (which also subclasses StreamControl).
- This happens because of the cascade of StreamControl::CloseAllReadStreams -> (potentially multiple times) ReadStream::Inner::CloseStream -> ReadStream::Inner::Close -> ReadStream::Inner::NoteClosed -> (because already on current thread) ReadStream::Inner::NoteClosedOnOwningThread -> StreamControl::NoteClosed -> (virtual dispatched, differs between parent and child) CacheStreamControlParent::NoteClosedAfterForget -> CacheStreamControlParent::RecvNoteClosed -> StreamList::NoteClosed -> (because mList.IsEmpty() && mStreamControl) CacheStreamControlParent::Shutdown -> CacheStreamControlParent::Send__delete__ -> DEALLOCATED.
- When we return to CloseAllReadStreams, the instance is deallocated and if we touch anything
of the instance, it's a use-after-free.
- Note that StreamControl::CloseAllReadStreams is invoked in the following chain that's common to both our free and our use-after-free: Context::CancelAll -> (iterating over mActivityList) StreamList::Cancel -> StreamList::CloseAll -> CacheStreamControlParent::CloseAll -> CacheStreamControlParent::NotifyCloseAll -> StreamControl::CloseAllReadStreams.

In production, the case we expect to be invoked because of PBackground's abrupt termination is CacheStreamControlParent::ActorDestroy at https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/dom/cache/CacheStreamControlParent.cpp#96 which:
- Invokes StreamControl::CloseAllReadStreamsWithoutReporting() -> (iterating, multiple invocations) ReadStream::Inner::CloseStreamWithoutReporting -> ReadStream::Inner::Forget -> (because already on the right thread) StreamControl::ForgetReadStream which just removes the stream from mReadStreamList.
- The above control flow notably differs in that we're just calling ForgetReadStream on StreamControl instead of calling NoteClosed which calls ForgetReadStream but then goes on to invoke NoteClosedAfterForget which is where things get all self-deallocatey.
- invokes (cache) Manager::RemoveListener. Don't care.
- invokes StreamList::RemoveStreamControl which clears its mStreamControl which precludes SCacheStreamControlParent::Shutdown being invoked.
- invokes StreamList::NoteClosedAll. This is post facto cleanup via (cache) Manager::ReleaseBodyId followed by a conditional call to CacheStreamControlParent::Shutdown. Because StreamList::RemoveStreamControl was called immediately before NoteClosedAll, that conditional call to Shutdown is never made.

There's a number of questions that are begged here. Things are complicated because our DEBUG shutdown approach is very complicated and arguably a bit non-sensical. I think the right answer for what to do depends on what we do for privacy reasons, we abort operations for an origin. The stack for that looks like cache::QuotaClient::AbortOperations() -> cache::Manager::Abort -> cache::Manager::Factory::Abort -> cache::Context::CancelAll. Since that's also what's causing the problem here, this is something we need to fix as-is without just changing how shutdown works. (Although, it does also beg the question of whether AbortOperations should already have ensured all the relevant globals have terminated. But it's also the case that this is something we need to address from both tearing down the global and ensuring that we tear down all the activities they were up to.) We do want the actors torn down.

I think that makes the right place to clean this up StreamList::Cancel/CloseAll. Only Cancel calls CloseAll. (Cancel does have other callers that branch out a bit.) Because the expected outcome is a series of calls to StreamList::NoteClosed that will culminate in mStreamControl->Shutdown() being invoked, we can move that Shutdown call up the stack to a safe point by:
- In StreamList::CloseAll() is non-null, steal mStreamControl, leaving it null during the call to CloseAll().
- Then invoke streamControl->Shutdown before returning. This avoids having CacheStreamControlParent on the stack.

So, since I think I figured this out, I'm going to take this bug and put up a fix.

Revision 1 by

Andrew Sutherland [:asuth] (he/him)

on 2019-02-19 11:40:36 PST

## Risk Analysis
There is no chance for user events/scripts to run in the QuotaManager shutdown gap. The code is executing on PBackground where no content runs and content is only able to interact via IPC messages, and no IPC messages are allowed to be processed until after all of this code finishes running and returns control flow to the message loop.

We expect this problem to occur in the following circumstances:
- Content has some DOM Cache API requests in flight.
- One of the following is happening, neither of which are things nefarious content can directly control.
- This one, shutdown under a DEBUG or ASAN build (or similar non-production build).
- The user triggers privacy data clearing/sanitization for the origin without first terminating the globals in question. (I think impacted tabs will be closed, but things are a little tricky with ServiceWorkers, and with races and such, this could happen.)

## Defect Analysis

It's also worth noting that ASAN builds differ from (non-debug) production builds. In production, https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/dom/ipc/ContentChild.cpp#2142 (compiled in due to https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/xpcom/base/nscore.h#172) will have killed the content process during "profile-before-change". This will take down the PBackground channel and cause cleanup due to ActorDestroy invocations, avoiding what we're seeing above which happens during "profile-before-change-qm" which is strictly after "profile-before-change".

But in ASAN/DEBUG shutdown, PBackground continues to exist until it gets cleanly torn down from either end. The DOM Cache implementation makes things difficult for itself here by not just letting the WorkerPrivate completely shutdown and let the PBackground channel be torn down. It tries to keep going so pending requests can be fulfilled and/or actors are explicitly torn down (out of concern for the process being aborted due to a message arriving for a dead actor, something that I believe may no longer be fatal). Which is how we can find ourselves with QuotaManager being the one trying to tear down client-owned PBackground actors. (Note that there are also some races likely involved.)

The other good news is that ASAN is awesome and it helps illuminate the problem here that:
- CacheStreamControlParent::Shutdown() is invoking Send__delete__. The class has an IPC-managed lifecycle which means that the instance (which also subclasses StreamControl).
- This happens because of the cascade of StreamControl::CloseAllReadStreams -> (potentially multiple times) ReadStream::Inner::CloseStream -> ReadStream::Inner::Close -> ReadStream::Inner::NoteClosed -> (because already on current thread) ReadStream::Inner::NoteClosedOnOwningThread -> StreamControl::NoteClosed -> (virtual dispatched, differs between parent and child) CacheStreamControlParent::NoteClosedAfterForget -> CacheStreamControlParent::RecvNoteClosed -> StreamList::NoteClosed -> (because mList.IsEmpty() && mStreamControl) CacheStreamControlParent::Shutdown -> CacheStreamControlParent::Send__delete__ -> DEALLOCATED.
- When we return to CloseAllReadStreams, the instance is deallocated and if we touch anything
of the instance, it's a use-after-free.
- Note that StreamControl::CloseAllReadStreams is invoked in the following chain that's common to both our free and our use-after-free: Context::CancelAll -> (iterating over mActivityList) StreamList::Cancel -> StreamList::CloseAll -> CacheStreamControlParent::CloseAll -> CacheStreamControlParent::NotifyCloseAll -> StreamControl::CloseAllReadStreams.

In production, the case we expect to be invoked because of PBackground's abrupt termination is CacheStreamControlParent::ActorDestroy at https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/dom/cache/CacheStreamControlParent.cpp#96 which:
- Invokes StreamControl::CloseAllReadStreamsWithoutReporting() -> (iterating, multiple invocations) ReadStream::Inner::CloseStreamWithoutReporting -> ReadStream::Inner::Forget -> (because already on the right thread) StreamControl::ForgetReadStream which just removes the stream from mReadStreamList.
- The above control flow notably differs in that we're just calling ForgetReadStream on StreamControl instead of calling NoteClosed which calls ForgetReadStream but then goes on to invoke NoteClosedAfterForget which is where things get all self-deallocatey.
- invokes (cache) Manager::RemoveListener. Don't care.
- invokes StreamList::RemoveStreamControl which clears its mStreamControl which precludes SCacheStreamControlParent::Shutdown being invoked.
- invokes StreamList::NoteClosedAll. This is post facto cleanup via (cache) Manager::ReleaseBodyId followed by a conditional call to CacheStreamControlParent::Shutdown. Because StreamList::RemoveStreamControl was called immediately before NoteClosedAll, that conditional call to Shutdown is never made.

There's a number of questions that are begged here. Things are complicated because our DEBUG shutdown approach is very complicated and arguably a bit non-sensical. I think the right answer for what to do depends on what we do for privacy reasons where we abort operations for an origin. The stack for that looks like cache::QuotaClient::AbortOperations() -> cache::Manager::Abort -> cache::Manager::Factory::Abort -> cache::Context::CancelAll. Since that's also what's causing the problem here, this is something we need to fix as-is without just changing how shutdown works. (Although, it does also beg the question of whether AbortOperations should already have ensured all the relevant globals have terminated. But it's also the case that this is something we need to address from both tearing down the global and ensuring that we tear down all the activities they were up to.) We do want the actors torn down.

I think that makes the right place to clean this up StreamList::Cancel/CloseAll. Only Cancel calls CloseAll. (Cancel does have other callers that branch out a bit.) Because the expected outcome is a series of calls to StreamList::NoteClosed that will culminate in mStreamControl->Shutdown() being invoked, we can move that Shutdown call up the stack to a safe point by:
- In StreamList::CloseAll() is non-null, steal mStreamControl, leaving it null during the call to CloseAll().
- Then invoke streamControl->Shutdown before returning. This avoids having CacheStreamControlParent on the stack.

So, since I think I figured this out, I'm going to take this bug and put up a fix.

Revision 2 by

Andrew Sutherland [:asuth] (he/him)

on 2019-02-19 11:42:01 PST

## Risk Analysis
There is no chance for user events/scripts to run in the QuotaManager shutdown gap. The code is executing on PBackground where no content runs and content is only able to interact via IPC messages, and no IPC messages are allowed to be processed until after all of this code finishes running and returns control flow to the message loop.

We expect this problem to occur in the following circumstances:
- Content has some DOM Cache API requests in flight.
- One of the following is happening, neither of which are things nefarious content can directly control.
- This one, shutdown under a DEBUG or ASAN build (or similar non-production build).
- The user triggers privacy data clearing/sanitization for the origin without first terminating the globals in question. (I think impacted tabs will be closed, but things are a little tricky with ServiceWorkers, and with races and such, this could happen.)

## Defect Analysis

It's also worth noting that ASAN builds differ from (non-debug) production builds. In production, https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/dom/ipc/ContentChild.cpp#2142 (compiled in due to https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/xpcom/base/nscore.h#172) will have killed the content process during "profile-before-change". This will take down the PBackground channel and cause cleanup due to ActorDestroy invocations, avoiding what we're seeing above which happens during "profile-before-change-qm" which is strictly after "profile-before-change".

But in ASAN/DEBUG shutdown, PBackground continues to exist until it gets cleanly torn down from either end. The DOM Cache implementation makes things difficult for itself here by not just letting the WorkerPrivate completely shutdown and let the PBackground channel be torn down. It tries to keep going so pending requests can be fulfilled and/or actors are explicitly torn down (out of concern for the process being aborted due to a message arriving for a dead actor, something that I believe may no longer be fatal). Which is how we can find ourselves with QuotaManager being the one trying to tear down client-owned PBackground actors. (Note that there are also some races likely involved.)

The other good news is that ASAN is awesome and it helps illuminate the problem here that:
- CacheStreamControlParent::Shutdown() is invoking Send__delete__. The class has an IPC-managed lifecycle which means that the instance (which also subclasses StreamControl).
- This happens because of the cascade of StreamControl::CloseAllReadStreams -> (potentially multiple times) ReadStream::Inner::CloseStream -> ReadStream::Inner::Close -> ReadStream::Inner::NoteClosed -> (because already on current thread) ReadStream::Inner::NoteClosedOnOwningThread -> StreamControl::NoteClosed -> (virtual dispatched, differs between parent and child) CacheStreamControlParent::NoteClosedAfterForget -> CacheStreamControlParent::RecvNoteClosed -> StreamList::NoteClosed -> (because mList.IsEmpty() && mStreamControl) CacheStreamControlParent::Shutdown -> CacheStreamControlParent::Send__delete__ -> DEALLOCATED.
- When we return to CloseAllReadStreams, the instance is deallocated and if we touch anything
of the instance, it's a use-after-free.
- Note that StreamControl::CloseAllReadStreams is invoked in the following chain that's common to both our free and our use-after-free: Context::CancelAll -> (iterating over mActivityList) StreamList::Cancel -> StreamList::CloseAll -> CacheStreamControlParent::CloseAll -> CacheStreamControlParent::NotifyCloseAll -> StreamControl::CloseAllReadStreams.

In production, the case we expect to be invoked because of PBackground's abrupt termination is CacheStreamControlParent::ActorDestroy at https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/dom/cache/CacheStreamControlParent.cpp#96 which:
- Invokes StreamControl::CloseAllReadStreamsWithoutReporting() -> (iterating, multiple invocations) ReadStream::Inner::CloseStreamWithoutReporting -> ReadStream::Inner::Forget -> (because already on the right thread) StreamControl::ForgetReadStream which just removes the stream from mReadStreamList.
- The above control flow notably differs in that we're just calling ForgetReadStream on StreamControl instead of calling NoteClosed which calls ForgetReadStream but then goes on to invoke NoteClosedAfterForget which is where things get all self-deallocatey.
- invokes (cache) Manager::RemoveListener. Don't care.
- invokes StreamList::RemoveStreamControl which clears its mStreamControl which precludes SCacheStreamControlParent::Shutdown being invoked.
- invokes StreamList::NoteClosedAll. This is post facto cleanup via (cache) Manager::ReleaseBodyId followed by a conditional call to CacheStreamControlParent::Shutdown. Because StreamList::RemoveStreamControl was called immediately before NoteClosedAll, that conditional call to Shutdown is never made.

There's a number of questions that are begged here. Things are complicated because our DEBUG shutdown approach is very complicated and arguably a bit non-sensical. I think the right answer for what to do depends on what we do for privacy reasons where we abort operations for an origin. The stack for that looks like cache::QuotaClient::AbortOperations() -> cache::Manager::Abort -> cache::Manager::Factory::Abort -> cache::Context::CancelAll. Since that's also what's causing the problem here, this is something we need to fix as-is without just changing how shutdown works. (Although, it does also beg the question of whether AbortOperations should already have ensured all the relevant globals have terminated. But it's also the case that this is something we need to address from both tearing down the global and ensuring that we tear down all the activities they were up to.) We do want the actors torn down.

I think that makes the right place to clean this up StreamList::Cancel/CloseAll. Only Cancel calls CloseAll. (Cancel does have other callers that branch out a bit.) Because the expected outcome is a series of calls to StreamList::NoteClosed that will culminate in mStreamControl->Shutdown() being invoked, we can move that Shutdown call up the stack to a safe point by:
- In StreamList::CloseAll() where mStreamControl is non-null, steal mStreamControl, leaving it null during the call to CloseAll(), avoiding the call to Shutdown() in NoteClosed.
- Then invoke streamControl->Shutdown before returning. This avoids having CacheStreamControlParent on the stack but maintains the invariant that we've shutdown and deleted the stream control.

So, since I think I figured this out, I'm going to take this bug and put up a fix.