Closed Bug 1830443 Opened 1 year ago Closed 5 months ago

Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2224

Categories

(Core :: DOM: Workers, defect, P3)

defect

Tracking

()

RESOLVED FIXED
127 Branch
Tracking Status
firefox-esr115 --- wontfix
firefox114 --- wontfix
firefox115 --- wontfix
firefox118 --- wontfix
firefox119 --- wontfix
firefox120 --- wontfix
firefox125 --- wontfix
firefox126 --- wontfix
firefox127 --- fixed

People

(Reporter: tsmith, Assigned: edenchuang)

References

(Blocks 1 open bug)

Details

(Keywords: assertion, pernosco, testcase, Whiteboard: [bugmon:bisected,confirmed])

Attachments

(1 file)

1.20 KB, application/x-zip-compressed
Details

Found while fuzzing m-c 20230413-19cc7f9b40f7 (--enable-debug --enable-fuzzing)

A test case is not available. A Pernosco session is available here: https://pernos.co/debug/jTwdp2FqodzE0Xpl0idg5Q/index.html

Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/Ru
ntimeService.cpp:2224

#0 0x58f93cd8 in mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run() /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2224:7
#1 0x50edb711 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1233:16
#2 0x50ee3b05 in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:479:10
#3 0x5246712f in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:300:20
#4 0x522c9057 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:369:10
#5 0x522c8fd4 in MessageLoop::RunHandler() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:362:3
#6 0x522c8f8f in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:344:3
#7 0x50ed5640 in nsThread::ThreadFunc(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:391:10
#8 0x6fff53e5 in _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5
#9 0x5ddb35b76608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477:8
#10 0x6831f132 in __clone /build/glibc-SzIz7B/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95

I put some notes in that session. There might be something related to DOMRectReadOnly, but without the underlying JS it is hard to guess the flow, were you able to reduce this a bit and/or can you provide the JS as is?

Severity: -- → S3
Priority: -- → P3
Flags: needinfo?(twsmith)

OK, TIL that NSCAP_RELEASE(this, mRawPtr); does not set the mRawPtr to nullptr. That means that inspecting memory for ptr values gives false positives hard to detect. So the investigation so far in the pernosco session did not reveal any hot path for now.

Just for completeness and the records: In normal builds we do fill freed memory with poison values, such that also the mRawPtr would have been overwritten. But fuzzing builds seem to imply to be also asan builds, and those do not use jemalloc and thus no poisoning. So seeing those pointers uncleared is expected and not concerning at all (except for the confusion it causes when looking at them in such a pernosco session). Thanks to :mccr8 and :jesup to point me there.

Note that this still does not mean we made any progress with the investigation itself, I just learned something.

Attached file testcase.zip

Success!

Flags: needinfo?(twsmith)

Verified bug as reproducible on mozilla-central 20230519115028-225c5ab0d999.
Unable to bisect testcase (Testcase reproduces on start build!):

Start: e1d1107d438bbdad13a5c4f62911295ac8a16fcf (20220521094723)
End: 19cc7f9b40f7a8534e00f9abb411738836a9c9f9 (20230413035039)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)

Whiteboard: [bugmon:bisected,confirmed]
Flags: needinfo?(jstutte)

I won't get to this right now.

Flags: needinfo?(jstutte)

Hi Tyson, could we have in the meantime a pernosco session based on the reduced test case? Thanks a lot!

Flags: needinfo?(twsmith)

Sure, bugmon should be able to handle this.

Flags: needinfo?(twsmith)

Successfully recorded a pernosco session. A link to the pernosco session will be added here shortly.

A pernosco session for this bug can be found here.

Bugmon was unable reproduce this issue.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Keywords: bugmon

A change to the Taskcluster build definitions over the weekend caused Bugmon to fail when reproducing issues. This issue has been corrected. Re-enabling bugmon.

Keywords: bugmon
Summary: Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/Ru ntimeService.cpp:2224 → Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2224

I can actually permanently reproduce this crash with the following steps and a debug build on MacOS (M1).

Steps:

  1. Run mach build to create an artifact build
  2. Run the command: MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_SHUTDOWN=profile.json mach run
  3. Wait until Firefox is started and after some seconds click the profiler button in the toolbar to stop profiling
  4. Wait until the profiler UI has opened and the profile is shown
  5. Wait a bit further for the crash - if nothing happens try to work with the profiler UI until the crash appears

Note that once you started Firefox once, you definitely have to run mach build again before starting Firefox again. Maybe some caching might prevent the crash from happening when that is not done.

Jens, does that help? Maybe you are able to reproduce it now as well?

Flags: needinfo?(jstutte)

I was able to reproduce it on Windows this way, but I am not sure if it really helps me. I probably need to instrument the code a bit to see something.

(In reply to Bugmon [:jkratzer for issues] from comment #10)

A pernosco session for this bug can be found here.

In the meantime I also commented the older pernosco session a bit. In the "normal case" a worker global seems to be unlinked by the cycle collector in the repeatGCCC loop after calling the UnrootGlobalScopes(), as expected - see first 5 entries in the notebook.

In the failing case, this does not happen but interestingly during CC shutdown we move our global ptr to a smart ptr during CycleCollectedJSRuntime::DeferredFinalize as if we wanted to destroy it later and even destroy that nsCOMPtr later, but without our refcount going to 0 but to 1. In other words: CC handling seems to be fine and works as expected, and there is apparently a non-CC managed owning reference somewhere else. Or even worse maybe just a manual AddRef without even leaving traces of the pointer in memory, given that I did not find any suspicious by inspecting memory in gdb. Not sure where to go from here...

Flags: needinfo?(jstutte)

Testcase crashes using the initial build (mozilla-central 20230429092024-8339bdf8fcc8) but not with tip (mozilla-central 20240426214429-c77d9ee9ea34.)

The bug appears to have been fixed in the following build range:

Start: 7a398ae80184ee13fbf609dd765b5bc9e1601951 (20240422164302)
End: 9a6af72177a39b6fdbecaebe01b85610b4e9d108 (20240422181549)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=7a398ae80184ee13fbf609dd765b5bc9e1601951&tochange=9a6af72177a39b6fdbecaebe01b85610b4e9d108

tsmith, can you confirm that the above bisection range is responsible for fixing this issue?
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Flags: needinfo?(twsmith)
Keywords: bugmon

I suspect bug 1875528 or bug 1724083 might have helped, or at least they changed some destruction related things, IIUC. Maybe :nika can tell?

Flags: needinfo?(nika)

I could believe that those changes may have improved some kind of buggy situaiton around re-entrant destruction of IPDL actors or similar on worker threads, though I don't understand this bug right now, so I can't say with any confidence what would have changed, as I don't know how we ended up in this situation in the first place.

There's also a chance that the change just made the specific reproduction steps not work by keeping alive some object slightly longer, and that the underlying bug is still present, but I can't say for sure.

Flags: needinfo?(nika)

I assume that the changes to worker lifecycle we made recently (like bug 1769913) could have improved something here, too, though the bisection tells something different. Maybe Tyson can double confirm that this testcase still reproduces after bug 1769913, but if the fuzzers are now happy I do not really see a different path forward here than just closing the bug.

Yes it looks much better from the fuzzing perspective. Thank you!

Status: NEW → RESOLVED
Closed: 5 months ago
Flags: needinfo?(twsmith)
Resolution: --- → FIXED
Assignee: nobody → echuang
Depends on: 1769913
Target Milestone: --- → 127 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: