Open Bug 1874800 Opened 8 months ago Updated 6 days ago

hanging at shutdown: Content process hangs on ShutdownPhase::XPCOMShutdownThreads and is killed by ChildLaxReaper timeout

Categories

(Core :: DOM: Workers, defect, P3)

defect

Tracking

()

Tracking Status
firefox123 --- affected

People

(Reporter: tsmith, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: hang, pernosco, testcase, Whiteboard: [bugmon:bisected,confirmed])

Attachments

(1 file)

Attached file testcase.html

Found while fuzzing m-c 20240114-39d188918af5 (--enable-debug --enable-fuzzing)

To reproduce via Grizzly Replay:

$ pip install fuzzfetch grizzly-framework
$ python -m fuzzfetch -d --fuzzing -n firefox
$ python -m grizzly.replay.bugzilla ./firefox/firefox <bugid> --relaunch 1

stderr:

[Parent 367548, IPC I/O Parent] WARNING: Process 367730 may be hanging at shutdown; will wait for up to 8000ms: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc:184
[Parent 367548, IPC I/O Parent] WARNING: Process 367730 hanging at shutdown; attempting crash report (fatal error).: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc:207
UndefinedBehaviorSanitizer:DEADLYSIGNAL
==367730==ERROR: UndefinedBehaviorSanitizer: ABRT on unknown address 0x03e800059bbc (pc 0x7fdbb3bfa117 bp 0x563e6e5b0748 sp 0x7fff3aa541d0 T367730)
    #0 0x7fdbb3bfa117 in __futex_abstimed_wait_common64 nptl/futex-internal.c:57:12
    #1 0x7fdbb3bfa117 in __futex_abstimed_wait_common nptl/futex-internal.c:87:9
    #2 0x7fdbb3bfa117 in __GI___futex_abstimed_wait_cancelable64 nptl/futex-internal.c:139:10
    #3 0x7fdbb3bfca40 in __pthread_cond_wait_common nptl/pthread_cond_wait.c:503:10
    #4 0x7fdbb3bfca40 in pthread_cond_wait nptl/pthread_cond_wait.c:627:10
    #5 0x563e6d7189fb in wait /builds/worker/checkouts/gecko/mozglue/misc/ConditionVariable_posix.cpp:106:11
    #6 0x563e6d7189fb in mozilla::detail::ConditionVariableImpl::wait_for(mozilla::detail::MutexImpl&, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&) /builds/worker/checkouts/gecko/mozglue/misc/ConditionVariable_posix.cpp:113:5
    #7 0x7fdb9ebdb5c7 in mozilla::OffTheBooksCondVar::Wait(mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator>) /builds/worker/checkouts/gecko/xpcom/threads/BlockingResourceBase.cpp:534:20
    #8 0x7fdb9ebe5f34 in Wait /builds/worker/checkouts/gecko/xpcom/threads/BlockingResourceBase.cpp:514:21
    #9 0x7fdb9ebe5f34 in mozilla::TaskController::GetRunnableForMTTask(bool) /builds/worker/checkouts/gecko/xpcom/threads/TaskController.cpp:619:19
    #10 0x7fdb9ec0a545 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1134:38
    #11 0x7fdb9ec0f5d0 in NS_ProcessNextEvent /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:480:10
    #12 0x7fdb9ec0f5d0 in SpinEventLoopUntil<(mozilla::ProcessFailureBehavior)1, (lambda at /builds/worker/checkouts/gecko/xpcom/threads/nsThreadManager.cpp:378:39)> /builds/worker/workspace/obj-build/dist/include/mozilla/SpinEventLoopUntil.h:176:25
    #13 0x7fdb9ec0f5d0 in nsThreadManager::ShutdownNonMainThreads() /builds/worker/checkouts/gecko/xpcom/threads/nsThreadManager.cpp:377:3
    #14 0x7fdb9ec5026b in mozilla::ShutdownXPCOM(nsIServiceManager*) /builds/worker/checkouts/gecko/xpcom/build/XPCOMInit.cpp:629:28
    #15 0x7fdba603598c in XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:660:16
    #16 0x563e6d6bb3b6 in content_process_main /builds/worker/checkouts/gecko/browser/app/../../ipc/contentproc/plugin-container.cpp:57:28
    #17 0x563e6d6bb3b6 in main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:375:18
    #18 0x7fdbb3b92d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #19 0x7fdbb3b92e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #20 0x563e6d6910e8 in _start (/home/worker/builds/m-c-20240114093125-fuzzing-debug/firefox-bin+0x590e8) (BuildId: 3c44943d507779f38c31adcf9a3c2a1d450f5497)
Flags: in-testsuite?

The testcase is almost identical with bug 1872960 except for the missing self.navigator.storage.getDirectory() (thus no involvement of file system / quota manager) which confirms the suspect that if quota manager would not have hanged on that bug we'd hang later on worker shutdown, anyways.

To recap shortly what happens (but I'll ask for a pernosco session to confirm 100%):

  • The blob definition a = new Blob([a, a, a, "�", a, ...], {}) adds a level of recursion to the generated blob on each onmessage, copying the blob from the previous call 6 times as piece into the new multipart blob.
  • As a result, on each post message the newly generated blob contains 6 times the previous one, growing thus exponentially.
  • This results in very long running runnables on the worker thread that block the worker's shutdown.

This seems to be mostly a DoS scenario that shows that badly written JS can hang us before it causes an OOM (which would be the ultimate reaction, I think). Unless this is a fuzz blocker I am not sure if we should spend too much time resolving this.

FWIW, it seems the fuzzer found a way for workers to do accidentally what bug 1844129 wants to do purposefully.

Severity: -- → S3
Keywords: pernosco-wanted
Priority: -- → P3
See Also: → 1872960, 1844129

I don't think bugmon will be able to get a Pernosco session for this because of the timing aspect. That said I was only able to successfully reproduce under rr with a -O2 build. Hopefully this has enough detail.

https://pernos.co/debug/wo-tyfJfPqGNdZ4XVwjsJw/index.html

(In reply to Tyson Smith [:tsmith] from comment #2)

I don't think bugmon will be able to get a Pernosco session for this because of the timing aspect.

That is true, the duration of the long lasting loop depends heavily on the execution speed, the fuzzer would need to evaluate different blob initializations in order to hit it again.

That said I was only able to successfully reproduce under rr with a -O2 build. Hopefully this has enough detail.

https://pernos.co/debug/wo-tyfJfPqGNdZ4XVwjsJw/index.html

Thank you for your time! This opt build shows a very similar behavior, but then we do not hang the parent process on shutdown but force kill the content process as we are not a debug build such that self->KillHard("ShutDownKill"); kicks in when a content process takes more than 20 secs to shutdown.

From bug 1872960 comment 17:

But we should verify that this unblocks the parent's shutdown really and does not leave any clutter in our book keeping (of quota manager or workers). I assume running the test case on an opt build would be enough to confirm.

Your session confirms exactly this. Thanks again!

(In reply to Tyson Smith [:tsmith] from comment #2)

I was only able to successfully reproduce under rr with a -O2 build. Hopefully this has enough detail.

Just to clarify, this was a debug build built -O2.

(In reply to Tyson Smith [:tsmith] from comment #4)

(In reply to Tyson Smith [:tsmith] from comment #2)

I was only able to successfully reproduce under rr with a -O2 build. Hopefully this has enough detail.

Just to clarify, this was a debug build built -O2.

Ah. I think I need to spend more time with that session then.

Verified bug as reproducible on mozilla-central 20240116163721-fd640ad054ea.
Unable to bisect testcase (Testcase reproduces on start build!):

Start: 60b4965aa0ca5a7a60c71229600092a65df8bc1d (20230118095131)
End: 39d188918af5021ffb4edb21a55d55006591bad0 (20240114093125)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)

Whiteboard: [bugmon:bisected,confirmed]

(In reply to Tyson Smith [:tsmith] from comment #4)

(In reply to Tyson Smith [:tsmith] from comment #2)

I was only able to successfully reproduce under rr with a -O2 build. Hopefully this has enough detail.

Just to clarify, this was a debug build built -O2.

OK, the SIGABRT I see happening on the blocked content process comes from ChildLaxReaper::CrashProcessIfHanging and happens after a hardcoded 8000 ms when the parent's IPC IO thread ends. That means the parent's shutdown sequence was not blocked before getting to delete sIOThread; and this timer kicked in.

Apparently what happens is that the content process is able to acknowledge and unblock its shutdown on the main thread but later hangs waiting for the blocked worker thread on ShutdownPhase::XPCOMShutdownThreads. So it makes totally sense that the ChildLaxReaper timeout kicks in and not the ContentParent::ForceKillTimerCallback.

Looking at comment 0, that is exactly what the log was always saying here:

[Parent 367548, IPC I/O Parent] WARNING: Process 367730 may be hanging at shutdown; will wait for up to 8000ms: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc:184
[Parent 367548, IPC I/O Parent] WARNING: Process 367730 hanging at shutdown; attempting crash report (fatal error).: file /builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/process_watcher_posix_sigchld.cc:207
UndefinedBehaviorSanitizer:DEADLYSIGNAL

It might be interesting to look a bit more into this session how come we unblocked the worker shutdown blocker of the parent and proceeded happily with our shutdown on the parent side. There might be a WorkerRef missing somewhere? Edit: Probably not, we actually want the worker to end.

Summary: hanging at shutdown → hanging at shutdown: Content process hangs on ShutdownPhase::XPCOMShutdownThreads and is killed by ChildLaxReaper timeout

While we can learn something about our shutdown sequence, I think there is no tangible user impact here.

Severity: S3 → S4

Testcase crashes using the initial build (mozilla-central 20240114093125-39d188918af5) but not with tip (mozilla-central 20240809212700-7f7cc1db0cad.)

The bug appears to have been fixed in the following build range:

Start: 7832bc78b5a700a810046d3d04eca252269e626a (20240710133220)
End: ad8fc2890e760cd3934808ca9243b27190a8b308 (20240710135343)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=7832bc78b5a700a810046d3d04eca252269e626a&tochange=ad8fc2890e760cd3934808ca9243b27190a8b308

tsmith, can you confirm that the above bisection range is responsible for fixing this issue?
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Flags: needinfo?(twsmith)
Keywords: bugmon

This is still reproducible with m-c 20240909-50016ed27344.

Flags: needinfo?(twsmith)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: