Closed Bug 1727405 Opened 3 months ago Closed 3 months ago

Add an assertion to double-check that ClientManager::GetOrCreateForCurrentThread is not returning a stale ClientManager instance


(Core :: DOM: Workers, task)




93 Branch
Tracking Status
firefox93 --- fixed


(Reporter: rpl, Assigned: rpl)




(1 file)

This issue goal is to evaluate the addition of a new assertion (or diagnostic assertion) that explicitly checks that ClientManager::GetOrCreateForCurrentThread is going to return a ClientManager instance for which cm->IsShutdown() returns false.

Follows some more context about what is the issues that the proposed diagnostic assertion is meant to help us with.

If a leak is keeping a worker global scope alive after we have terminated the worker, then: - the ClientSource instance associated to that worker will also be kept alive (by the WorkerPrivate instance)

  • and that as a side-effect prevents also the ClientManager instance associated to the DOM worker thread from hitting ClientManager::Shutdown and be cleared from the thread locals

At that point, Gecko will add the worker thread into the idle threads to potentially reuse to run a new worker, but reusing that worker doesn't currently fail visibly and instead triggers a number of side-effects (which may be even potentially confused for the actual underlying issue).

In the particular case I have been investigating:

  • the worker script running on the worker thread with the stale ClientManager instance was being executed just fine
  • we noticed the issue only because we were actually asserting the scriptURL included in the worker's ClientInfo was set, but surprisingly we were getting it set to an empty string (starting from the second test task part of the test file being executed)
  • but that was actually just a side-effect due to ClientSource::WorkerExecutionReady early existing before it gets to call ClientSource::ExecutionReady (which is what would have set the WorkerPrivate's mClientInfo URL to the worker script url as we would expect to be set)

It would be nice if an assertion (or diagnostic assertion) could trigger an explicit failure sooner, so that it may make it easier to track back what was triggering the actual underlying issue.

While investigating an unexpected test failure triggered by a mochitest that was testing the identity WebExtensions API
(D121683 from Bug 1723852) I did notice that the actual underlying issue was triggered by a leak
(in particular the Extension API class in the initial draft of that D121683 patch was missing a RefPtr
in the NS_IMPL_CYCLE_COLLECTION_WRAPPERCACHE macro used for that class) but the issue was presenting
itself as an empty scriptURL in the ClientSource for the test case that was executed after the one that
triggered tha leak, and far enough from where it was actually triggered.

To make it easier to spot the issue nearer to the actual underlying issue, I think that it would be
reasonable to add a diagnostic assertion to ClientManager::GetOrCreateForCurrentThread that would
be triggered earlier if a leak was keeping the ClientManager instance alive in the idle DOM Worker

Assignee: nobody → lgreco
Summary: Add an assertion to double-check that ClientManager::GetOrCreateForCurrentThread is not returning a stale ClientManager instance when we are spawning a worker in a idle DOM worker thread being reused → Add an assertion to double-check that ClientManager::GetOrCreateForCurrentThread is not returning a stale ClientManager instance
Pushed by
Add a diagnostic assert to check that ClientManager::GetOrCreateForCurrentThread is not returning a stale ClientManager instance that is already shutdown. r=asuth

Backed out changeset 9e8267a443e5 (Bug 1727405) for causing telemetry failures in ClientManager.cpp
Backout link:
Push with failures, failure log.

Flags: needinfo?(lgreco)

eh, apparently we were already hitting this "unexpected" scenario with that marionette test.

I haven't been able to trigger this outside of an asan opt build, and unfortunately what seems to be an unrelated issue with llvm-symbolizer (llvm-symbolizer: Unknown command line argument '--inlines') the failure log linked in comment 3 does not include a readable stack trace.

I was curious about what is the call path that made us to hit the diagnostic assert and so I did an asan build locally and I've been able to reproduce the issue and get a symbolized stack trace, follows an extract of it:

    #0 0x7f032e6a45db in mozilla::dom::ClientManager::GetOrCreateForCurrentThread() /.../ClientManager.cpp:225:3
    #1 0x7f032e6a55ae in ExpectOrForgetFutureSource /.../ClientManager.cpp:241:31
    #2 0x7f032e6a55ae in mozilla::dom::ClientManager::ForgetFutureSource(mozilla::dom::ClientInfo const&) /.../ClientManager.cpp:261:10
    #3 0x7f032e6bf682 in mozilla::dom::(anonymous namespace)::ClientChannelHelperParent::SetFutureSourceInfo(mozilla::Maybe<moz
illa::dom::ClientInfo>&&) /.../ClientChannelHelper.cpp:220:7
    #4 0x7f032e6bf111 in mozilla::dom::(anonymous namespace)::ClientChannelHelperParent::~ClientChannelHelperParent() /.../ClientChannelHelper.cpp:209:5
    #5 0x7f032e6bf3cd in mozilla::dom::(anonymous namespace)::ClientChannelHelperParent::~ClientChannelHelperParent() /.../ClientChannelHelper.cpp:204:32
    #6 0x7f032e6b69bf in mozilla::dom::(anonymous namespace)::ClientChannelHelper::Release() /.../ClientChannelHelper.cpp:200:1
    #7 0x7f03279c6b95 in ProxyRelease<nsISupports> /.../xpcom/threads/nsProxyRelease.h
    #8 0x7f03279c6b95 in detail::ProxyReleaseChooser<true>::ProxyReleaseISupports(char const*, nsIEventTarget*, nsISupports*, bool) /.../xpcom/threads/nsProxyRelease.cpp:15:3
    #16 0x7f03359c7b16 in mozilla::psm::TransportSecurityInfo::Release() /.../security/manager/ssl/TransportSecurityInfo.cpp:67:1
    #17 0x7f032890b864 in ~nsCOMPtr_base /.../dist/include/nsCOMPtr.h:328:7
    #18 0x7f032890b864 in mozilla::net::HttpBaseChannel::~HttpBaseChannel() /.../netwerk/protocol/http/HttpBaseChannel.cpp:245:1
    #26 0x7f032775e2c2 in mozilla::IncrementalFinalizeRunnable::ReleaseNow(bool) /.../CycleCollectedJSRuntime.cpp:1658:17
    #27 0x7f032775ebcf in mozilla::CycleCollectedJSRuntime::FinalizeDeferredThings(mozilla::CycleCollectedJSContext::DeferredFinalizeType) /.../CycleCollectedJSRuntime.cpp:1734:24
    #39 0x7f03277bda65 in nsCycleCollector::ShutdownCollect() /.../nsCycleCollector.cpp:3350:20
    #40 0x7f03277c35cc in Shutdown /.../nsCycleCollector.cpp:3644:5
    #41 0x7f03277c35cc in nsCycleCollector_shutdown(bool) /.../nsCycleCollector.cpp:3959:18
    #42 0x7f0327a530f8 in mozilla::ShutdownXPCOM(nsIServiceManager*) /.../XPCOMInit.cpp:709:3
    #43 0x7f03362742f7 in ScopedXPCOMStartup::~ScopedXPCOMStartup() /.../nsAppRunner.cpp:1683:5

This symbolized stack trace seems to confirm the feeling that I already had: we were hitting it late in the shutdown path.

Flags: needinfo?(lgreco)

I confirmed locally that if we skip the new diagnostic assertion when we are already shutting down, then the telemetry marionette test doesn't seem to be able to trigger the diagnostic assertion anymore (as I was expecting).

Nevertheless, I'm not sure that ClientManager::GetOrCreateForCurrentThread should have been called when we are so late on the shutdown path, my current feeling is that would be technically more "hiding the issue" than "fixing it".

Making the caller to check if we are late in the shutdown and avoid to get to call ClientManager::GerOrCreateForCurrentThread seems like a more correct way to avoid the diagnostic crash to be triggered in the scenario that this marionette test is recreating.

Locally I've tried the following two approaches and both of them seems to be able to make this test to complete successfully (I tried to run it 10 times in a row to also hopefully caught some possible intermittency):

  • approach 1: Change ClientManager::GetOrCreateForCurrentThread to skip the new diagnostic assertion on !cm->IsShutdown() if PastShutdownPhase(ShutdownPhase::XPCOMShutdown) is true
  • approach 2: Change ClientManager::ExpectOrForgetFutureSource to return false earlier than calling ClientManager::GetOrCreateForCurrentThread if PastShutdownPhase(ShutdownPhase::XPCOMShutdown) is true

Hey Andrew, what do you think about these two possible approaches?
(see comment 4 for a symbolized stack trace for the same kind of diagnostic crash that has been reported along with the backout in comment 3)

Personally I'm currently a bit more in favor of the second one for the reasons described above.

Flags: needinfo?(bugmail)

I have discussed about this with :asuth over slack, and we agreed to proceed with "approach 2" described in comment 5.

I've just update the patch on phabricator and the following is a new push to try:

Flags: needinfo?(bugmail)
Pushed by
Add a diagnostic assert to check that ClientManager::GetOrCreateForCurrentThread is not returning a stale ClientManager instance that is already shutdown. r=asuth
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
You need to log in before you can comment on or make changes to this bug.