Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2224
Categories
(Core :: DOM: Workers, defect, P3)
Tracking
()
People
(Reporter: tsmith, Assigned: edenchuang)
References
(Blocks 1 open bug)
Details
(Keywords: assertion, pernosco, testcase, Whiteboard: [bugmon:bisected,confirmed])
Attachments
(1 file)
1.20 KB,
application/x-zip-compressed
|
Details |
Found while fuzzing m-c 20230413-19cc7f9b40f7 (--enable-debug --enable-fuzzing)
A test case is not available. A Pernosco session is available here: https://pernos.co/debug/jTwdp2FqodzE0Xpl0idg5Q/index.html
Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/Ru
ntimeService.cpp:2224
#0 0x58f93cd8 in mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run() /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2224:7
#1 0x50edb711 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1233:16
#2 0x50ee3b05 in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:479:10
#3 0x5246712f in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:300:20
#4 0x522c9057 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:369:10
#5 0x522c8fd4 in MessageLoop::RunHandler() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:362:3
#6 0x522c8f8f in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:344:3
#7 0x50ed5640 in nsThread::ThreadFunc(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:391:10
#8 0x6fff53e5 in _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5
#9 0x5ddb35b76608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477:8
#10 0x6831f132 in __clone /build/glibc-SzIz7B/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Comment 1•1 year ago
|
||
I put some notes in that session. There might be something related to DOMRectReadOnly
, but without the underlying JS it is hard to guess the flow, were you able to reduce this a bit and/or can you provide the JS as is?
Updated•1 year ago
|
Comment 2•1 year ago
•
|
||
OK, TIL that NSCAP_RELEASE(this, mRawPtr);
does not set the mRawPtr
to nullptr
. That means that inspecting memory for ptr values gives false positives hard to detect. So the investigation so far in the pernosco session did not reveal any hot path for now.
Comment 3•1 year ago
|
||
Just for completeness and the records: In normal builds we do fill freed memory with poison values, such that also the mRawPtr
would have been overwritten. But fuzzing builds seem to imply to be also asan builds, and those do not use jemalloc and thus no poisoning. So seeing those pointers uncleared is expected and not concerning at all (except for the confusion it causes when looking at them in such a pernosco session). Thanks to :mccr8 and :jesup to point me there.
Note that this still does not mean we made any progress with the investigation itself, I just learned something.
Reporter | ||
Updated•1 year ago
|
Comment 5•1 year ago
|
||
Verified bug as reproducible on mozilla-central 20230519115028-225c5ab0d999.
Unable to bisect testcase (Testcase reproduces on start build!):
Start: e1d1107d438bbdad13a5c4f62911295ac8a16fcf (20220521094723)
End: 19cc7f9b40f7a8534e00f9abb411738836a9c9f9 (20230413035039)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)
Updated•1 year ago
|
Comment 7•1 year ago
|
||
Hi Tyson, could we have in the meantime a pernosco session based on the reduced test case? Thanks a lot!
Reporter | ||
Updated•1 year ago
|
Comment 9•1 year ago
|
||
Successfully recorded a pernosco session. A link to the pernosco session will be added here shortly.
Comment 10•1 year ago
|
||
A pernosco session for this bug can be found here.
Comment 11•1 year ago
|
||
Bugmon was unable reproduce this issue.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Comment 12•1 year ago
|
||
A change to the Taskcluster build definitions over the weekend caused Bugmon to fail when reproducing issues. This issue has been corrected. Re-enabling bugmon.
Reporter | ||
Updated•1 year ago
|
Reporter | ||
Updated•1 year ago
|
Comment 13•9 months ago
|
||
I can actually permanently reproduce this crash with the following steps and a debug build on MacOS (M1).
Steps:
- Run
mach build
to create an artifact build - Run the command:
MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_SHUTDOWN=profile.json mach run
- Wait until Firefox is started and after some seconds click the profiler button in the toolbar to stop profiling
- Wait until the profiler UI has opened and the profile is shown
- Wait a bit further for the crash - if nothing happens try to work with the profiler UI until the crash appears
Note that once you started Firefox once, you definitely have to run mach build
again before starting Firefox again. Maybe some caching might prevent the crash from happening when that is not done.
Jens, does that help? Maybe you are able to reproduce it now as well?
Comment 14•9 months ago
•
|
||
I was able to reproduce it on Windows this way, but I am not sure if it really helps me. I probably need to instrument the code a bit to see something.
(In reply to Bugmon [:jkratzer for issues] from comment #10)
A pernosco session for this bug can be found here.
In the meantime I also commented the older pernosco session a bit. In the "normal case" a worker global seems to be unlinked by the cycle collector in the repeatGCCC loop after calling the UnrootGlobalScopes()
, as expected - see first 5 entries in the notebook.
In the failing case, this does not happen but interestingly during CC shutdown we move our global ptr to a smart ptr during CycleCollectedJSRuntime::DeferredFinalize
as if we wanted to destroy it later and even destroy that nsCOMPtr
later, but without our refcount going to 0 but to 1. In other words: CC handling seems to be fine and works as expected, and there is apparently a non-CC managed owning reference somewhere else. Or even worse maybe just a manual AddRef
without even leaving traces of the pointer in memory, given that I did not find any suspicious by inspecting memory in gdb. Not sure where to go from here...
Comment 15•5 months ago
|
||
Testcase crashes using the initial build (mozilla-central 20230429092024-8339bdf8fcc8) but not with tip (mozilla-central 20240426214429-c77d9ee9ea34.)
The bug appears to have been fixed in the following build range:
Start: 7a398ae80184ee13fbf609dd765b5bc9e1601951 (20240422164302)
End: 9a6af72177a39b6fdbecaebe01b85610b4e9d108 (20240422181549)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=7a398ae80184ee13fbf609dd765b5bc9e1601951&tochange=9a6af72177a39b6fdbecaebe01b85610b4e9d108
tsmith, can you confirm that the above bisection range is responsible for fixing this issue?
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.
Comment 16•5 months ago
•
|
||
I suspect bug 1875528 or bug 1724083 might have helped, or at least they changed some destruction related things, IIUC. Maybe :nika can tell?
Comment 17•5 months ago
|
||
I could believe that those changes may have improved some kind of buggy situaiton around re-entrant destruction of IPDL actors or similar on worker threads, though I don't understand this bug right now, so I can't say with any confidence what would have changed, as I don't know how we ended up in this situation in the first place.
There's also a chance that the change just made the specific reproduction steps not work by keeping alive some object slightly longer, and that the underlying bug is still present, but I can't say for sure.
Comment 18•5 months ago
|
||
I assume that the changes to worker lifecycle we made recently (like bug 1769913) could have improved something here, too, though the bisection tells something different. Maybe Tyson can double confirm that this testcase still reproduces after bug 1769913, but if the fuzzers are now happy I do not really see a different path forward here than just closing the bug.
Reporter | ||
Comment 19•5 months ago
|
||
Yes it looks much better from the fuzzing perspective. Thank you!
Updated•5 months ago
|
Description
•