Closed Bug 1729123 Opened 3 months ago Closed 3 months ago

Crash in [@ mozilla::dom::ClientManager::GetOrCreateForCurrentThread]

Categories

(Core :: DOM: Workers, defect)

defect

Tracking

()

RESOLVED FIXED
94 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox91 --- unaffected
firefox92 --- unaffected
firefox93 --- fixed
firefox94 --- fixed

People

(Reporter: aryx, Assigned: rpl)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

Currently 6 crashes from 3 installations across all OS platforms.

Crash report: https://crash-stats.mozilla.org/report/index/0f70133b-5273-4807-a691-ae9ab0210904

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(!cm->IsShutdown())

Top 10 frames of crashing thread:

0 XUL mozilla::dom::ClientManager::GetOrCreateForCurrentThread dom/clients/manager/ClientManager.cpp:226
1 XUL mozilla::dom::ClientManager::CreateSource dom/clients/manager/ClientManager.cpp:315
2 XUL mozilla::dom::WorkerPrivate::CreateClientSource dom/workers/WorkerPrivate.cpp:3215
3 XUL mozilla::dom::WorkerPrivate::GetOrCreateGlobalScope dom/workers/WorkerPrivate.cpp:5191
4 XUL mozilla::dom:: dom/workers/WorkerPrivate.cpp:359
5 XUL mozilla::dom::WorkerRunnable::Run dom/workers/WorkerRunnable.cpp:378
6 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1142
7 XUL NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:466
8 XUL mozilla::dom::WorkerPrivate::DoRunLoop dom/workers/WorkerPrivate.cpp:3073
9 XUL mozilla::dom::workerinternals:: dom/workers/RuntimeService.cpp:2243
Flags: needinfo?(lgreco)

Hey Andrew,
based on the amount of diagnostic crashed collected it looks that on Fenix we are hitting this condition quite often, way less on Desktop.

In the sample of the crash reports that I looked to:

  • CompileScriptRunnable seems to be the runnable most frequently being processed when we hit the diagnostic crash (in all the ones I looked to, but I don't exclude there may be a different one less frequent)
  • all crash reports seems to be triggered in a child process
  • the uptime varies, in some cases is short enough (few seconds or a couple of minutes), but in some other cases the uptime was quite long (hours of uptime)

Based on this, I can't tell yet if the diagnostic crash may be hit because we are late in the process shutdown and still trying to execute a worker, or if there were leaks that are preventing the shutdown ClientManager to be deallocate from an idle worker thread we are trying to reuse.

The easiest way to confirm if we are hitting this just because we are late in the process shutdown could be to skip the diagnostic assertion if we are already shutting down (but if that confirms we are still trying to run a worker late in the process shutdown, we may then want to follow up by bailing out earlier in the flow to run the worker when we are already shutting down, and then bringing back the diagnostic crash to see if there are other conditions that can make us hit it).

wdyt?

Flags: needinfo?(lgreco) → needinfo?(bugmail)

Hi, just a general reminder that determining reliably if we are shutting down is not easy, currently. There is some work in flight in bug 1726813 to improve the situation, but this will help only in the parent process. The child processes would need some more love from bug 1697745 (which is only a raw sketch and might open a can of worms), though at least all XPCOM shutdown phases should be advanced correctly also there.

It seems like we're seeing that there's existing code with leak-y behavior that doesn't have sufficient test coverage and will require investigation.

Jens' point about the lack of shutdown context information in the content process is also quite right and makes me suspect that this is more likely to be the thread reuse case than a shutdown related case.

ClientManager already knows how to return errors when it has been shutdown, so I would propose that we do something like:

  • In the Worker primary runnable, right before we shut down PBackground for the thread we add an explicit call to a new ClientManager static method called ClientManager::ShutdownForThisThread or something like that.
    • This will clear the thread-local storage and call Shutdown on the ClientManager if it hasn't already been cleared.
    • And we want to make sure that the destructor still works whether it comes before or after that call.
  • Consider upgrading ClientThing::mShutdown from a boolean to an enum class that can could have states like Shutdown and Poisoned, with the new helper above setting the shutdown mode to Poisoned.
    • This would then enable us to have a mode of operation where we try and have the shutdown checks diagnostic assert indicating what was actually keeping the ClientManager alive rather than us just knowing that it was kept alive. This could get ugly though, as it might need to go in the Release() for ClientManager, but I think we have macros for that and it should already be virtual.

I think the additional state tracking for the means of shutdown and properly attributing the failures is a much larger piece than stopping the crashes in Fenix, so I think it would make sense to do the former first and file a follow-up for the latter as the latter also entails potentially non-trivial investigative work and refactorings.

Flags: needinfo?(bugmail)

And at a meta-level, I think it's great and good if there's a way to stop the diagnostic assert here while still providing coverage to the situation you were trying to guard against in your development. For example, maybe we can have a more aggressive or different behavior if test preferences are set.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #4)

And at a meta-level, I think it's great and good if there's a way to stop the diagnostic assert here while still providing coverage to the situation you were trying to guard against in your development. For example, maybe we can have a more aggressive or different behavior if test preferences are set.

+1, that would sounds good to me too, I'm definitely ok on using a testing preference to decide if should guard cm->IsShutdown() with a diagnostic assert or not.

Assignee: nobody → lgreco
Status: NEW → ASSIGNED
Pushed by luca.greco@alcacoop.it:
https://hg.mozilla.org/integration/autoland/rev/4ae29776c6b9
Run diagnostic assertion on cm->IsShutdown only if dom.workers.testing.enabled is true. r=asuth
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch

Luca, can you request uplift to beta? Thanks

Flags: needinfo?(lgreco)

(In reply to Pascal Chevrel:pascalc from comment #9)

Luca, can you request uplift to beta? Thanks

Sure thing, I did briefly talk with :asuth yesterday about requesting an uplift for this and I was going to request an uplift today.

Flags: needinfo?(lgreco)

Comment on attachment 9239693 [details]
Bug 1729123 - Run diagnostic assertion on cm->IsShutdown only if dom.workers.testing.enabled is true. r?asuth!

Beta/Release Uplift Approval Request

  • User impact if declined: Users on non release channels builds where the diagnostic assertion added by Bug 1727405 is actually being executed (e.g. early beta) may experience the resulting diagnostic crashes of Firefox content child processes.

Based on the crash stats, the diagnostic assertion is hit pretty frequently on Fenix (and apparently still hit but way less frequently on Desktop).

  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: The code is covered by automated tests (but no automated tests is hitting the conditions that make the diagnostic assertion to be hit).
    Manually verification of the change isn't possible because we don't have a known STR for the scenario that is hitting the diagnostic assertion, but the frequency of the crash reported for the diagnostic assertion are expected to go back to 0 as they did on Nightly (which will technically be a manual verification).
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch is pretty small (just a few lines) and restricted to ClientManager::GetOrCreateForCurrentThread:

it does only check skip if the pref dom.workers.testing.enabled is not explicitly set to true and skip the diagnostic assertion if that is the case.

The fix has been landed on mozilla-central and the crash reports related to Nightly 94 have already confirmed the expected outcome (that the diagnostic assertion isn't executed for the user Firefox profiles, where dom.workers.testing.enabled is set to false by default).

  • String changes made/needed:
Attachment #9239693 - Flags: approval-mozilla-beta?

Comment on attachment 9239693 [details]
Bug 1729123 - Run diagnostic assertion on cm->IsShutdown only if dom.workers.testing.enabled is true. r?asuth!

Uplift approved for 93 beta 3, thanks.

Attachment #9239693 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.