Closed Bug 1830443 Opened 1 year ago Closed 5 months ago

Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2224

Tracking

()

Status:

RESOLVED FIXED

Milestone:

127 Branch

Tracking Flags:

Tracking

Status

firefox-esr115

---

wontfix

firefox114

---

wontfix

firefox115

---

wontfix

firefox118

---

wontfix

firefox119

---

wontfix

firefox120

---

wontfix

firefox125

---

wontfix

firefox126

---

wontfix

firefox127

---

fixed

People

(Reporter: tsmith, Assigned: edenchuang)

References

(Blocks 1 open bug)

Details

(Keywords: assertion, pernosco, testcase, Whiteboard: [bugmon:bisected,confirmed])

Attachments

(1 file)

testcase.zip 1 year ago Tyson Smith [:tsmith] 1.20 KB, application/x-zip-compressed		Details

Tyson Smith [:tsmith]

Reporter

Description

•

1 year ago

Found while fuzzing m-c 20230413-19cc7f9b40f7 (--enable-debug --enable-fuzzing)

A test case is not available. A Pernosco session is available here: https://pernos.co/debug/jTwdp2FqodzE0Xpl0idg5Q/index.html

Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/Ru
ntimeService.cpp:2224

#0 0x58f93cd8 in mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run() /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2224:7
#1 0x50edb711 in nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1233:16
#2 0x50ee3b05 in NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:479:10
#3 0x5246712f in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:300:20
#4 0x522c9057 in MessageLoop::RunInternal() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:369:10
#5 0x522c8fd4 in MessageLoop::RunHandler() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:362:3
#6 0x522c8f8f in MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:344:3
#7 0x50ed5640 in nsThread::ThreadFunc(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:391:10
#8 0x6fff53e5 in _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5
#9 0x5ddb35b76608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477:8
#10 0x6831f132 in __clone /build/glibc-SzIz7B/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Jens Stutte [:jstutte]

Comment 1

•

1 year ago

I put some notes in that session. There might be something related to DOMRectReadOnly, but without the underlying JS it is hard to guess the flow, were you able to reduce this a bit and/or can you provide the JS as is?

Severity: -- → S3

Priority: -- → P3

Jens Stutte [:jstutte]

Updated

•

1 year ago

Flags: needinfo?(twsmith)

Jens Stutte [:jstutte]

Comment 2

•

1 year ago

•

Edited

OK, TIL that NSCAP_RELEASE(this, mRawPtr); does not set the mRawPtr to nullptr. That means that inspecting memory for ptr values gives false positives hard to detect. So the investigation so far in the pernosco session did not reveal any hot path for now.

Jens Stutte [:jstutte]

Comment 3

•

1 year ago

Just for completeness and the records: In normal builds we do fill freed memory with poison values, such that also the mRawPtr would have been overwritten. But fuzzing builds seem to imply to be also asan builds, and those do not use jemalloc and thus no poisoning. So seeing those pointers uncleared is expected and not concerning at all (except for the confusion it causes when looking at them in such a pernosco session). Thanks to :mccr8 and :jesup to point me there.

Note that this still does not mean we made any progress with the investigation itself, I just learned something.

Tyson Smith [:tsmith]

Reporter

Comment 4

•

1 year ago

Attached file testcase.zip — Details

Success!

Flags: needinfo?(twsmith)

Tyson Smith [:tsmith]

Reporter

Updated

•

1 year ago

status-firefox115: --- → affected

Keywords: bugmon, testcase

Bugmon [:jkratzer for issues]

Comment 5

•

1 year ago

Verified bug as reproducible on mozilla-central 20230519115028-225c5ab0d999.
Unable to bisect testcase (Testcase reproduces on start build!):

Start: e1d1107d438bbdad13a5c4f62911295ac8a16fcf (20220521094723)
End: 19cc7f9b40f7a8534e00f9abb411738836a9c9f9 (20230413035039)
BuildFlags: BuildFlags(asan=False, tsan=False, debug=True, fuzzing=True, coverage=False, valgrind=False, no_opt=False, fuzzilli=False, nyx=False)

Whiteboard: [bugmon:bisected,confirmed]

Jens Stutte [:jstutte]

Updated

•

1 year ago

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Comment 6

•

1 year ago

I won't get to this right now.

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Comment 7

•

1 year ago

Hi Tyson, could we have in the meantime a pernosco session based on the reduced test case? Thanks a lot!

Flags: needinfo?(twsmith)

Tyson Smith [:tsmith]

Reporter

Comment 8

•

1 year ago

Sure, bugmon should be able to handle this.

Keywords: pernosco, testcase → pernosco-wanted, testcase-wanted

Tyson Smith [:tsmith]

Reporter

Updated

•

1 year ago

Flags: needinfo?(twsmith)

Keywords: testcase-wanted → testcase

Bugmon [:jkratzer for issues]

Comment 9

•

1 year ago

Successfully recorded a pernosco session. A link to the pernosco session will be added here shortly.

Keywords: pernosco-wanted → pernosco

Bugmon [:jkratzer for issues]

Comment 10

•

1 year ago

A pernosco session for this bug can be found here.

Bugmon [:jkratzer for issues]

Comment 11

•

1 year ago

Bugmon was unable reproduce this issue.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Keywords: bugmon

Jason Kratzer [:jkratzer]

Comment 12

•

1 year ago

A change to the Taskcluster build definitions over the weekend caused Bugmon to fail when reproducing issues. This issue has been corrected. Re-enabling bugmon.

Keywords: bugmon

Tyson Smith [:tsmith]

Reporter

Updated

•

1 year ago

Summary: Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/Ru ntimeService.cpp:2224 → Assertion failure: !globalScopeSentinel->IsAlive(), at /builds/worker/checkouts/gecko/dom/workers/RuntimeService.cpp:2224

Tyson Smith [:tsmith]

Reporter

Updated

•

1 year ago

status-firefox114: affected → wontfix

status-firefox115: affected → wontfix

status-firefox118: --- → wontfix

status-firefox119: --- → affected

status-firefox120: --- → affected

status-firefox-esr115: --- → affected

Henrik Skupin [:whimboo][⌚️UTC+2] (away 10/03 - 10/13)

Comment 13

•

9 months ago

I can actually permanently reproduce this crash with the following steps and a debug build on MacOS (M1).

Steps:

Run mach build to create an artifact build
Run the command: MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_SHUTDOWN=profile.json mach run
Wait until Firefox is started and after some seconds click the profiler button in the toolbar to stop profiling
Wait until the profiler UI has opened and the profile is shown
Wait a bit further for the crash - if nothing happens try to work with the profiler UI until the crash appears

Note that once you started Firefox once, you definitely have to run mach build again before starting Firefox again. Maybe some caching might prevent the crash from happening when that is not done.

Jens, does that help? Maybe you are able to reproduce it now as well?

Flags: needinfo?(jstutte)

Jens Stutte [:jstutte]

Comment 14

•

9 months ago

•

Edited

I was able to reproduce it on Windows this way, but I am not sure if it really helps me. I probably need to instrument the code a bit to see something.

(In reply to Bugmon [:jkratzer for issues] from comment #10)

A pernosco session for this bug can be found here.

In the meantime I also commented the older pernosco session a bit. In the "normal case" a worker global seems to be unlinked by the cycle collector in the repeatGCCC loop after calling the UnrootGlobalScopes(), as expected - see first 5 entries in the notebook.

In the failing case, this does not happen but interestingly during CC shutdown we move our global ptr to a smart ptr during CycleCollectedJSRuntime::DeferredFinalize as if we wanted to destroy it later and even destroy that nsCOMPtr later, but without our refcount going to 0 but to 1. In other words: CC handling seems to be fine and works as expected, and there is apparently a non-CC managed owning reference somewhere else. Or even worse maybe just a manual AddRef without even leaving traces of the pointer in memory, given that I did not find any suspicious by inspecting memory in gdb. Not sure where to go from here...

Flags: needinfo?(jstutte)

Bugmon [:jkratzer for issues]

Comment 15

•

5 months ago

Testcase crashes using the initial build (mozilla-central 20230429092024-8339bdf8fcc8) but not with tip (mozilla-central 20240426214429-c77d9ee9ea34.)

The bug appears to have been fixed in the following build range:

Start: 7a398ae80184ee13fbf609dd765b5bc9e1601951 (20240422164302)
End: 9a6af72177a39b6fdbecaebe01b85610b4e9d108 (20240422181549)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=7a398ae80184ee13fbf609dd765b5bc9e1601951&tochange=9a6af72177a39b6fdbecaebe01b85610b4e9d108

tsmith, can you confirm that the above bisection range is responsible for fixing this issue?
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Flags: needinfo?(twsmith)

Keywords: bugmon

Jens Stutte [:jstutte]

Comment 16

•

5 months ago

•

Edited

I suspect bug 1875528 or bug 1724083 might have helped, or at least they changed some destruction related things, IIUC. Maybe :nika can tell?

Flags: needinfo?(nika)

Nika Layzell [:nika] (ni? for response)

Comment 17

•

5 months ago

I could believe that those changes may have improved some kind of buggy situaiton around re-entrant destruction of IPDL actors or similar on worker threads, though I don't understand this bug right now, so I can't say with any confidence what would have changed, as I don't know how we ended up in this situation in the first place.

There's also a chance that the change just made the specific reproduction steps not work by keeping alive some object slightly longer, and that the underlying bug is still present, but I can't say for sure.

Flags: needinfo?(nika)

Jens Stutte [:jstutte]

Comment 18

•

5 months ago

I assume that the changes to worker lifecycle we made recently (like bug 1769913) could have improved something here, too, though the bisection tells something different. Maybe Tyson can double confirm that this testcase still reproduces after bug 1769913, but if the fuzzers are now happy I do not really see a different path forward here than just closing the bug.

Tyson Smith [:tsmith]

Reporter

Comment 19

•

5 months ago

Yes it looks much better from the fuzzing perspective. Thank you!

Status: NEW → RESOLVED

Closed: 5 months ago

Flags: needinfo?(twsmith)

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Updated

•

5 months ago

Assignee: nobody → echuang

status-firefox119: affected → wontfix

status-firefox120: affected → wontfix

status-firefox125: --- → wontfix

status-firefox126: --- → wontfix

status-firefox127: --- → fixed

status-firefox-esr115: affected → wontfix

Depends on: 1769913

Target Milestone: --- → 127 Branch

You need to log in before you can comment on or make changes to this bug.