Hit MOZ_CRASH(Quota manager shutdown timed out) at /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:2273
Categories
(Core :: Storage: Quota Manager, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox123 | --- | affected |
People
(Reporter: tsmith, Unassigned)
References
(Blocks 2 open bugs)
Details
(4 keywords, Whiteboard: [bugmon:bisected,confirmed])
Crash Data
Attachments
(1 file, 1 obsolete file)
1.15 KB,
text/html
|
Details |
Found while fuzzing m-c 20231105-c6548a743f8f (--enable-debug --enable-fuzzing)
To reproduce via Grizzly Replay:
$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay.bugzilla ./firefox/firefox <bugid> --no-harness --timeout 90
Hit MOZ_CRASH(Quota manager shutdown timed out) at /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:2273
#0 0x7fe63f241a70 in operator() /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:2273:5
#1 0x7fe63f241a70 in mozilla::dom::quota::QuotaManager::Shutdown()::$_11::__invoke(nsITimer*, void*) /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:2238:36
#2 0x7fe63aa5631d in operator() /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:681:36
#3 0x7fe63aa5631d in matchN<mozilla::Variant<nsTimerImpl::UnknownCallback, nsCOMPtr<nsITimerCallback>, nsCOMPtr<nsIObserver>, nsTimerImpl::FuncCallback, nsTimerImpl::ClosureCallback> &, (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:681:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:682:7)> /builds/worker/workspace/obj-build/dist/include/mozilla/Variant.h:309:16
#4 0x7fe63aa5631d in matchN<mozilla::Variant<nsTimerImpl::UnknownCallback, nsCOMPtr<nsITimerCallback>, nsCOMPtr<nsIObserver>, nsTimerImpl::FuncCallback, nsTimerImpl::ClosureCallback> &, (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:678:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:681:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:682:7)> /builds/worker/workspace/obj-build/dist/include/mozilla/Variant.h:318:14
#5 0x7fe63aa5631d in matchN<mozilla::Variant<nsTimerImpl::UnknownCallback, nsCOMPtr<nsITimerCallback>, nsCOMPtr<nsIObserver>, nsTimerImpl::FuncCallback, nsTimerImpl::ClosureCallback> &, (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:677:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:678:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:681:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:682:7)> /builds/worker/workspace/obj-build/dist/include/mozilla/Variant.h:318:14
#6 0x7fe63aa5631d in matchN<mozilla::Variant<nsTimerImpl::UnknownCallback, nsCOMPtr<nsITimerCallback>, nsCOMPtr<nsIObserver>, nsTimerImpl::FuncCallback, nsTimerImpl::ClosureCallback> &, (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:676:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:677:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:678:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:681:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:682:7)> /builds/worker/workspace/obj-build/dist/include/mozilla/Variant.h:318:14
#7 0x7fe63aa5631d in matchN<mozilla::Variant<nsTimerImpl::UnknownCallback, nsCOMPtr<nsITimerCallback>, nsCOMPtr<nsIObserver>, nsTimerImpl::FuncCallback, nsTimerImpl::ClosureCallback> &, (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:676:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:677:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:678:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:681:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:682:7)> /builds/worker/workspace/obj-build/dist/include/mozilla/Variant.h:902:12
#8 0x7fe63aa5631d in match<(lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:676:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:677:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:678:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:681:7), (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:682:7)> /builds/worker/workspace/obj-build/dist/include/mozilla/Variant.h:857:12
#9 0x7fe63aa5631d in nsTimerImpl::Fire(int) /builds/worker/checkouts/gecko/xpcom/threads/nsTimerImpl.cpp:675:22
#10 0x7fe63aa5548b in nsTimerEvent::Run() /builds/worker/checkouts/gecko/xpcom/threads/TimerThread.cpp:515:11
#11 0x7fe63aa637b1 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1193:16
#12 0x7fe63aa6a78d in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:480:10
#13 0x7fe63f202501 in SpinEventLoopUntil<(mozilla::ProcessFailureBehavior)1, (lambda at /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:2410:5)> /builds/worker/workspace/obj-build/dist/include/mozilla/SpinEventLoopUntil.h:176:25
#14 0x7fe63f202501 in mozilla::dom::quota::QuotaManager::Shutdown() /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:2410:5
#15 0x7fe63f1fd58b in mozilla::dom::quota::QuotaManager::ShutdownInstance() /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:1755:16
#16 0x7fe63f1fd470 in mozilla::dom::quota::RecvShutdownQuotaManager() /builds/worker/checkouts/gecko/dom/quota/ActorsParent.cpp:1405:3
#17 0x7fe63b6c88b6 in mozilla::ipc::BackgroundParentImpl::RecvShutdownQuotaManager() /builds/worker/checkouts/gecko/ipc/glue/BackgroundParentImpl.cpp:982:8
#18 0x7fe63b7c0f68 in mozilla::ipc::PBackgroundParent::OnMessageReceived(IPC::Message const&) /builds/worker/workspace/obj-build/ipc/ipdl/PBackgroundParent.cpp:4327:52
#19 0x7fe63b737b5f in mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy*, IPC::Message const&) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1813:25
#20 0x7fe63b7348b2 in mozilla::ipc::MessageChannel::DispatchMessage(mozilla::ipc::ActorLifecycleProxy*, mozilla::UniquePtr<IPC::Message, mozilla::DefaultDelete<IPC::Message>>) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1732:9
#21 0x7fe63b735532 in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::ActorLifecycleProxy*, mozilla::ipc::MessageChannel::MessageTask&) /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1525:3
#22 0x7fe63b73667f in mozilla::ipc::MessageChannel::MessageTask::Run() /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp:1623:14
#23 0x7fe63aa637b1 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1193:16
#24 0x7fe63aa6a78d in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:480:10
#25 0x7fe63b73ed15 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:330:5
#26 0x7fe63b657591 in RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:363:3
#27 0x7fe63b657591 in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:345:3
#28 0x7fe63aa5ea83 in nsThread::ThreadFunc(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:370:10
#29 0x7fe6502ddd0f in _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5
#30 0x7fe650094ac2 in start_thread nptl/pthread_create.c:442:8
#31 0x7fe65012665f misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Comment 1•9 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release
- Top 20 desktop browser crashes on beta
- Top 10 desktop browser crashes on nightly
- Top 5 desktop browser crashes on Windows on beta
- Top 5 desktop browser crashes on Windows on release
For more information, please visit BugBot documentation.
Comment 2•9 months ago
|
||
Verified bug as reproducible on mozilla-central 20240104213501-45533d2448ef.
Unable to bisect testcase (Testcase reproduces on start build!):
Start: db88fa190f63506c1da204a5ff73202d679611e9 (20230106041444)
End: c6548a743f8f4efdbbb05ff98e749612bf9d1329 (20231105091909)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)
Updated•9 months ago
|
Updated•9 months ago
|
Updated•9 months ago
|
Updated•9 months ago
|
Reporter | ||
Comment 3•9 months ago
|
||
I think bugmon is struggling with that. In the meantime https://pernos.co/debug/j9E3H34nKjgi3jJVIELJxg/index.html.
Comment 4•9 months ago
|
||
(In reply to Tyson Smith [:tsmith] from comment #3)
I think bugmon is struggling with that. In the meantime https://pernos.co/debug/j9E3H34nKjgi3jJVIELJxg/index.html.
Thanks! I added tons of comments...
IIUC we have an order problem. Speaking at a very high level:
- QuotaManager asks all quota clients to release their directory locks when starting shutdown with the
ShutdownStorageOp
trying to acquire an exclusive lock. It blocks until it will get that exclusive lock. - In particular, FileSystem can open and keep a lock for quite a while, and so does it here from a worker on a content process. QuotaManager needs the collaboration of that content process in terms of reacting properly on
Send/RecvCloseAll
. I assume this collaboration is really necessary only when the content process is actually doing file operations, in this example we only ever requested a root handle. - However, when the main process shutdown starts (and before the quota manager shutdown phase even begins) we do
SignalImpendingShutdownToContentJS
which seems to interfere here. In fact the content process seems to have decided to shutdown, and when the timeout assertion fires the content process shutdown arrived well intoShutdownPhase::XPCOMShutdownThreads
, being stuck onRuntimeService::Cleanup
and waiting probably for that same worker that initiated the FileSystem usage.
I think this is a systemic race, but I assume most of the times the FileSystem RecvCloseAll
loses it and will not be executed anymore, so I'd expect this to happen pretty much always when we enter shutdown with a content process holding an active directory lock.
We could mitigate this somehow by implementing a handler for file system when we execute the killActorsTimerCallback
during QM shutdown, but that would probably just paint over the underlying problem and leave us with a systemic shutdown delay.
I assume we should find a way to become more independent from what's happening in the content process here, or we need to take back or defer the SignalImpendingShutdownToContentJS
- or both.
My understanding is that write operations will go to a copy of the file, and interrupting them will not make us lose the original file. So my gut feeling would be that "why bother waiting for the content process at all, just freeze the state on the parent side and ignore further requests" could be a strategy here, but I surely might overlook something.
Tagging :smaug for the content process bits and :jari for the file system pieces for thoughts.
Comment 5•9 months ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #4)
My understanding is that write operations will go to a copy of the file, and interrupting them will not make us lose the original file. So my gut feeling would be that "why bother waiting for the content process at all, just freeze the state on the parent side and ignore further requests" could be a strategy here, but I surely might overlook something.
But this is only one case when WritableFileStream writes to a temporary file. There's also SyncAccessHandle which writes directly to the main file.
And the described mechanism above is also used when entire origin is supposed to be cleared. We need to close all open files before the clearing can be performed. OPFS writes to files directly in content processes, so content processes need to close them.
My understanding is that normally content processes should be able to respond to IPC during shutdown, there are only some edge cases when the main thread or a worker in content is extremely busy or even stuck, so new IPC messages can't be processed. In that case we should use the last resort which you already mentioned, force kill actors (after 5 seconds).
However, I don't know if the interference with SignalImpendingShutdownToContentJS
should be considered as an edge case where we would rely on the force kill of actors. If it can happen quite often, I would say, we should try to fix it to not interfere ?
Comment 6•9 months ago
•
|
||
(In reply to Jan Varga [:janv] from comment #5)
But this is only one case when WritableFileStream writes to a temporary file. There's also SyncAccessHandle which writes directly to the main file.
And the described mechanism above is also used when entire origin is supposed to be cleared. We need to close all open files before the clearing can be performed. OPFS writes to files directly in content processes, so content processes need to close them.
OK, I surely overlooked something... :-)
My understanding is that normally content processes should be able to respond to IPC during shutdown, there are only some edge cases when the main thread or a worker in content is extremely busy or even stuck, so new IPC messages can't be processed. In that case we should use the last resort which you already mentioned, force kill actors (after 5 seconds).
Note that the content process is able to respond to IPC until it shuts down and to do whatever it needs to do during its own shutdown. Only the execution of content JS should be interrupted by SignalImpendingShutdownToContentJS
, but depending on the nature of the process this might lead to a shutdown? I actually wonder, why the process is starting its shutdown already. What's worse, the content process tries to shut down but is stuck on the worker. I assume/hope that if the shutdown was successful, the actor had been destroyed in the parent and everything would have been fine.
So I think something in the handling of FS actors and/or workers in conjunction on the content side did not exactly do what it should on shutdown, and whatever it is might bite us in other situations where content processes are shutting down for whatever reasons.
However, I don't know if the interference with
SignalImpendingShutdownToContentJS
should be considered as an edge case where we would rely on the force kill of actors. If it can happen quite often, I would say, we should try to fix it to not interfere ?
No, I think it is not an edge case. We always call it before QM enters shutdown on all content processes. It might be more edgier if it needs FS to be used on a worker, but I'd assume at least in our CI that happens pretty often. And I'd definitely want to understand better what's happening there.
SignalImpendingShutdownToContentJS
is doing two things, that only in conjunction will interrupt content js. One is gated by dom.abort_script_on_child_shutdown
.
A first course of action might be to see if the test case reproduces locally and then flip that pref.
Edit: Looking at the test case, I also see a setTimeout("self.close()", 2000)
in the body of the main script. That might mean that the test closes the only window and that closes the entire browser, not any of our magic?
Comment 7•9 months ago
|
||
Clearing :smaug's ni? for now, further investigation on why we are shutting down pending. I still think someone FS fluent should try to look into the "why the content process is hanging" thing, though.
Comment 8•9 months ago
|
||
Jens, thanks for more detailed information.
If a content process starts shutting down before QM shutdown in the parent process, it should close all ts SyncAccessHandles and WritableFileStreams which means that close IPC messages are sent to the parent, so QM shutdown shouldn't then be blocked by that.
But as you said, ideally the testcase would be converted to a crashtest, so someone could take a closer look what really happens and if there's a bug in OPFS shutdown or it's something else.
One thing that should be definitely checked, how and when callbacks for WorkerRefs are called in this case. These callbacks send close IPC messages to the parent.
Comment 9•9 months ago
|
||
Running the test locally reproduces, even if I comment out the entire body of SignalImpendingShutdownToContentJS
. So I'd say it is not related and we can concentrate on "why the content process is hanging".
Comment 10•9 months ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #9)
Running the test locally reproduces, even if I comment out the entire body of
SignalImpendingShutdownToContentJS
. So I'd say it is not related and we can concentrate on "why the content process is hanging".
Interestingly the test does not crash if run in CI: https://treeherder.mozilla.org/jobs?repo=try&revision=26681d73fd73116fb176d38f1183221e1aef55ae as a crashtest.
And I am not able to run it locally as a crashtest at all, as the doodle from bug 1831144 is shown. I am able to run it locally just via mach run dom/fs/test/crashtests/1872960.html
and then it reproduces the QM shutdown hang.
Comment 11•9 months ago
|
||
Comment 12•9 months ago
|
||
OK, as far as the test is concerned, we probably need a marionette test here in order to make sure we are the only tab running in the parent process such that self.close
will shutdwon the parent.
In the meantime I took a second look at the pernosco session.
So I think something in the handling of FS actors and/or workers in conjunction on the content side did not exactly do what it should on shutdown
is true, at least one thing that goes wrong in this test case is that the worker in the content process is spammed by an endless loop of StringBlobImpl::CreateInputStream
such that any event posted to it will be added at the end of a very long event queue. For example our DumpCrashInfoRunnable::DispatchAndWait
gets no (timely) response from the worker. That makes it so that the worker thread will never join and the content process never dies. Unfortunately SignalImpendingShutdownToContentJS
has effects only on the main thread's content JS execution, I think.
Comment 13•9 months ago
|
||
IIUC the loop causing StringBlobImpl::CreateInputStream
is not endless but pretty long. IIUC the test case defines a
as:
a = new Blob([
a,
a,
a,
"�",
"0ð¨¬0Ù©Û¹ÙªÙªó ¥¦Ù©/=â§Â¿*�",
a,
undefined,
undefined,
a,
undefined,
"���٪\n⣺ð¡¬ð
®x鿦ê»",
">ã�",
"�áð¯¯ð§ð¬âð¯¦ð¯ªð¢±ï¿½",
"b㱿ó ¡ðë ç¢â®ð¸¡%",
undefined,
a,
"Ùªó
¿"
], { })
resulting in a pseudo-recursion during blob creation on BlobSet::AppendBlobImpl
first (ending up with a multipart blob with length==149297
) that then causes a promise chain of equal length to flood the worker thread queue for each postMessage
. That seems to be a simple DoS attack on the worker.
So while there is potential to improve the shutdown reaction for DOM File in the content process, the underlying DoS is happening on Blob.
Comment 14•9 months ago
|
||
So while there is potential to improve the shutdown reaction for DOM File in the content process
We currently have a shutdown blocker in the content process that
- is defined for
FileSystemWritableFileStream
only - is associated to
XpcomWillShutdown
, which is too late to help with anything - is active in DEBUG only, anyways
and thus has no real function in normal operations.
The better moment to react would be the "content-child-will-shutdown" notification (which is kind of equivalent to the AppShutdown
phase of the parent process) but for which there is no AsyncShutdown
barrier defined (yet). I think it would be good to make FileSystemManagerChild
listen to that notification and to do the equivalent of RecvCloseAll
when it happens (always, not only in debug), rather than having the current shutdown blocker.
Comment 15•9 months ago
|
||
We do use WorkerRefs in the worker case. XpcomWillShutdown is used only for the main thread.
Comment 16•9 months ago
|
||
What I want to say in comment 14 is that if we want to do something at shutdown in the child process, we should do so earlier than XpcomWillShutdown
and not only in debug. Given that the quota manager shutdown triggers a SendCloseAll
, it seemed reasonable to me to do the same course of actions on "content-child-will-shutdown" independently from the worker shutdown.
Said that, the root cause for the hang here is the blocked worker, of course.
Comment 17•9 months ago
•
|
||
Bug 1874800 seems to confirm that we'd hang later if quota manager would not hang here. Even a file system specific handler for the killActorsTimerCallback
would not change that, as well as the proposed changes from comment 14.
Still I think there is value in making quota manager / file system shutdown more resilient in general, at least the real culprit for the hang would jump out immediately. Unfortunately in this situation anything that needs to ultimately happen on the blocked worker thread of the content process would not help.
What helps us instead probably in release is self->KillHard("ShutDownKill");
which should kick in when a content process takes more than 20 secs to shutdown. But we should verify that this unblocks the parent's shutdown really and does not leave any clutter in our book keeping (of quota manager or workers). I assume running the test case on an opt build would be enough to confirm.
Comment 18•9 months ago
|
||
I assume running the test case on an opt build would be enough to confirm.
Running with an opt build just exits fast, approx. 2 secs. Maybe it is optimized enough to not hang at all then.
Comment 19•7 months ago
|
||
I do not think we can do too much here. Not sure if fuzzing again (obtaining even bigger multipart blobs) with an opt build is an option, in order to just confirm:
What helps us instead probably in release is
self->KillHard("ShutDownKill");
which should kick in when a content process takes more than 20 secs to shutdown. But we should verify that this unblocks the parent's shutdown really and does not leave any clutter in our book keeping (of quota manager or workers). I assume running the test case on an opt build would be enough to confirm.
Comment 20•6 months ago
|
||
If this is not a fuzz blocker, I'd close this as WONTFIX
.
Reporter | ||
Comment 21•6 months ago
|
||
Nope it's not a fuzzblocker.
Comment 22•6 months ago
|
||
No valid actions for resolution (WONTFIX).
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Updated•5 months ago
|
Description
•