Closed Bug 1941852 Opened 28 days ago Closed 18 days ago

Assertion failure: sInServoTraversal || NS_IsMainThread(), at /builds/worker/workspace/obj-build/dist/include/mozilla/ServoUtils.h:33

Categories

(Core :: DOM: Workers, defect, P3)

x86_64
Linux
defect

Tracking

()

VERIFIED FIXED
136 Branch
Tracking Status
firefox-esr115 --- unaffected
firefox-esr128 --- unaffected
firefox134 --- wontfix
firefox135 --- verified
firefox136 --- verified

People

(Reporter: jkratzer, Assigned: asuth)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: pernosco, regression, testcase, Whiteboard: [bugmon:bisected,confirmed])

Crash Data

Attachments

(4 files)

Testcase found while fuzzing mozilla-central rev 5904a2d552f2 (built with: --enable-debug --enable-fuzzing).

Testcase can be reproduced using the following commands:

$ pip install fuzzfetch grizzly-framework --upgrade
$ python -m fuzzfetch --build 5904a2d552f2 --debug --fuzzing  -n firefox
$ python -m grizzly.replay.bugzilla ./firefox/firefox <bugid>
Assertion failure: sInServoTraversal || NS_IsMainThread(), at /builds/worker/workspace/obj-build/dist/include/mozilla/ServoUtils.h:33

    ==579159==ERROR: UndefinedBehaviorSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f5c5e337717 bp 0x7f5c4b55a330 sp 0x7f5c4b55a330 T579215)
    ==579159==The signal is caused by a WRITE memory access.
    ==579159==Hint: address points to the zero page.
        #0 0x7f5c5e337717 in IsInServoTraversal /builds/worker/workspace/obj-build/dist/include/mozilla/ServoUtils.h:33:3
        #1 0x7f5c5e337717 in IsInServoTraversal /builds/worker/workspace/obj-build/dist/include/mozilla/ServoStyleSet.h:119:45
        #2 0x7f5c5e337717 in gfxFontUtils::IsInServoTraversal() /gfx/thebes/gfxFontUtils.cpp:1764:21
        #3 0x7f5c6295429d in mozilla::dom::FontFace::~FontFace() /layout/style/FontFace.cpp:80:3
        #4 0x7f5c6298f95c in DeleteCycleCollectable /layout/style/FontFace.cpp:71:1
        #5 0x7f5c6298f95c in mozilla::dom::FontFace::cycleCollection::DeleteCycleCollectable(void*) /builds/worker/workspace/obj-build/dist/include/mozilla/dom/FontFace.h:43:3
        #6 0x7f5c5cbb38b5 in SuspectAfterShutdown(void*, nsCycleCollectionParticipant*, nsCycleCollectingAutoRefCnt*, bool*) /xpcom/base/nsCycleCollector.cpp:4003:12
        #7 0x7f5c629541d1 in decr /builds/worker/workspace/obj-build/dist/include/nsISupportsImpl.h:291:7
        #8 0x7f5c629541d1 in decr /builds/worker/workspace/obj-build/dist/include/nsISupportsImpl.h:279:12
        #9 0x7f5c629541d1 in mozilla::dom::FontFace::Release() /layout/style/FontFace.cpp:71:1
        #10 0x7f5c5cb9a982 in Release /builds/worker/workspace/obj-build/dist/include/mozilla/RefPtr.h:49:40
        #11 0x7f5c5cb9a982 in ~nsCOMPtr /builds/worker/workspace/obj-build/dist/include/nsCOMPtr.h:344:7
        #12 0x7f5c5cb9a982 in ~SegmentImpl /builds/worker/workspace/obj-build/dist/include/mozilla/SegmentedVector.h:78:21
        #13 0x7f5c5cb9a982 in mozilla::SegmentedVector<nsCOMPtr<nsISupports>, 4096ul, mozilla::MallocAllocPolicy>::PopLastN(unsigned int) /builds/worker/workspace/obj-build/dist/include/mozilla/SegmentedVector.h:253:14
        #14 0x7f5c5cb83302 in mozilla::dom::DeferredFinalizerImpl<nsISupports>::DeferredFinalize(unsigned int, void*) /builds/worker/workspace/obj-build/dist/include/mozilla/dom/BindingUtils.h:2919:15
        #15 0x7f5c5cb8391a in mozilla::IncrementalFinalizeRunnable::ReleaseNow(bool) /xpcom/base/CycleCollectedJSRuntime.cpp:1689:17
        #16 0x7f5c5cb83f71 in mozilla::CycleCollectedJSRuntime::FinalizeDeferredThings(mozilla::CycleCollectedJSRuntime::DeferredFinalizeType) /xpcom/base/CycleCollectedJSRuntime.cpp:1768:24
        #17 0x7f5c5cb81a9f in mozilla::CycleCollectedJSRuntime::OnGC(JSContext*, JSGCStatus, JS::GCReason) /xpcom/base/CycleCollectedJSRuntime.cpp:1855:7
        #18 0x7f5c6416c121 in js::gc::GCRuntime::maybeCallGCCallback(JSGCStatus, JS::GCReason) /js/src/gc/GC.cpp:4406:3
        #19 0x7f5c6416d1f3 in ~AutoCallGCCallbacks /js/src/gc/GC.cpp:4379:32
        #20 0x7f5c6416d1f3 in js::gc::GCRuntime::gcCycle(bool, JS::SliceBudget const&, JS::GCReason) /js/src/gc/GC.cpp:4507:1
        #21 0x7f5c6416e983 in js::gc::GCRuntime::collect(bool, JS::SliceBudget const&, JS::GCReason) /js/src/gc/GC.cpp:4690:9
        #22 0x7f5c64153839 in js::gc::GCRuntime::gc(JS::GCOptions, JS::GCReason) /js/src/gc/GC.cpp:4769:3
        #23 0x7f5c63af0da3 in JSRuntime::destroyRuntime() /js/src/vm/Runtime.cpp:254:8
        #24 0x7f5c63996faf in js::DestroyContext(JSContext*) /js/src/vm/JSContext.cpp:227:7
        #25 0x7f5c5cb7571a in mozilla::CycleCollectedJSContext::~CycleCollectedJSContext() /xpcom/base/CycleCollectedJSContext.cpp:106:3
        #26 0x7f5c6210713c in ~WorkerJSContext /dom/workers/RuntimeService.cpp:904:3
        #27 0x7f5c6210713c in operator() /builds/worker/workspace/obj-build/dist/include/mozilla/UniquePtr.h:460:5
        #28 0x7f5c6210713c in reset /builds/worker/workspace/obj-build/dist/include/mozilla/UniquePtr.h:302:7
        #29 0x7f5c6210713c in ~UniquePtr /builds/worker/workspace/obj-build/dist/include/mozilla/UniquePtr.h:250:18
        #30 0x7f5c6210713c in mozilla::dom::workerinternals::(anonymous namespace)::WorkerThreadPrimaryRunnable::Run() /dom/workers/RuntimeService.cpp:2245:5
        #31 0x7f5c5cca675c in nsThread::ProcessNextEvent(bool, bool*) /xpcom/threads/nsThread.cpp:1153:16
        #32 0x7f5c5ccad24f in NS_ProcessNextEvent(nsIThread*, bool) /xpcom/threads/nsThreadUtils.cpp:480:10
        #33 0x7f5c5d85a788 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /ipc/glue/MessagePump.cpp:299:20
        #34 0x7f5c5d7a9fe1 in RunHandler /ipc/chromium/src/base/message_loop.cc:362:3
        #35 0x7f5c5d7a9fe1 in MessageLoop::Run() /ipc/chromium/src/base/message_loop.cc:344:3
        #36 0x7f5c5cca1ca3 in nsThread::ThreadFunc(void*) /xpcom/threads/nsThread.cpp:366:10
        #37 0x7f5c6ce1c9df in _pt_root /nsprpub/pr/src/pthreads/ptthread.c:191:3
        #38 0x7f5c6cec1a93 in start_thread nptl/pthread_create.c:447:8
        #39 0x7f5c6cf4ec3b in clone3 misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
    
    ==579159==Register values:
    rax = 0x00007f5c58d3b7c2  rbx = 0x00007f5b900824b0  rcx = 0x000063dbb6bf6a20  rdx = 0x00007f5c6d029563  
    rdi = 0x00007f5c6d02a700  rsi = 0x0000000000000000  rbp = 0x00007f5c4b55a330  rsp = 0x00007f5c4b55a330  
     r8 = 0x0000000000000000   r9 = 0x0000000000000003  r10 = 0x0000000000000000  r11 = 0x0000000000000293  
    r12 = 0x0000000000000003  r13 = 0x00007f5b900696a8  r14 = 0x00007f5b900824b0  r15 = 0x00007f5b900824f8  
    UndefinedBehaviorSanitizer can not provide additional info.
    SUMMARY: UndefinedBehaviorSanitizer: SEGV /builds/worker/workspace/obj-build/dist/include/mozilla/ServoUtils.h:33:3 in IsInServoTraversal
    ==579159==ABORTING
Attached file Testcase
Blocks: gfx-triage

Verified bug as reproducible on mozilla-central 20250115215720-f7524feb52aa.
The bug appears to have been introduced in the following build range:

Start: 2b86d382dab2e182131315e79fa26bb32affa790 (20241024013316)
End: 7936ca01a900402531a01de23474744fde9bbe1c (20241024040951)
Pushlog: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=2b86d382dab2e182131315e79fa26bb32affa790&tochange=7936ca01a900402531a01de23474744fde9bbe1c

Keywords: regression
Whiteboard: [bugmon:confirm] → [bugmon:bisected,confirmed]

Emilio, do you have thoughts on this? In triage we're not sure this is graphics.

Flags: needinfo?(emilio)

So this is a FontFace getting destroyed in a worker. We're asserting that we're not in the servo traversal, but dom::GetCurrentThreadWorkerPrivate seems to be returning null now here.

So the good thing is that it seems "bening", in the sense that it's only breaking an assertion by accident (since we're presumably still in a worker thread).

Andrew, it seems likely one of your worker changes in comment 2 caused this, any thoughts?

Flags: needinfo?(emilio) → needinfo?(bugmail)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

Andrew, it seems likely one of your worker changes in comment 2 caused this, any thoughts?

So, bug 1131324 exposed navigator.serviceWorker which is what the o[1] through o[3] are touching. Without bug 1131324 the test-case will be doing new WeakRef(undefined) and it seems likely this is significant. I suspect/hope this test case would reproduce prior to my landing if it were to use navigator.locks instead of navigator.serviceWorker, since that has been around for a longer time.

What's bizarre is that the FontFace is being destroyed as a fresh deferred finalize (it's not like a IncrementalFinalizeRunnable got wedged). The reference made it through the GC+CC cleanup gambit where we unroot the global scope, run at least one full shutdown GC and spin the event loop, and then call nsCycleCollector_shutdown() and spin the loop, and only have the teardown of the WorkerJSContext be where the finalizer runnable is called at the last moment (and efforts to get the WorkerPrivate from TLS will fail).

This is definitely a troublesome scenario, as dom::GetCurrentThreadWorkerPrivate is correctly returning null, but no code should be put in the position where it's being freed so late that it has to deal with that.

That said, one non-bizarre-GC possibility is that something wacky with multi-threaded ownership (mozilla::dom::FontFaceSetImpl::DispatchToOwningThread does exist) is running a runnable incredibly late on the worker thread and that is dropping the last strong reference to the FontFace on the worker which then interacts with the WeakRef somehow? But for a race like that, I would expect that there would have been a whole family of other assertions/crashes for the cases where the runnable doesn't make it to the worker in time.

I'm going to set the pernosco-wanted flag because if pernosco can reproduce this, it would be very nice. If we can't get a pernosco repro, getting a log from around the crash point with MOZ_LOG set to WorkerThread:5 and possibly WorkerPrivate:5 too could be quite interesting and help us rule out FontFace multithreading lifetime issues. (If we can rule those out, then we should move this to the Workers component.)

I'm also going to set a needinfo on :smaug for his CC expertise, I may be missing something obvious here.

Flags: needinfo?(bugmail) → needinfo?(smaug)
Keywords: pernosco-wanted

Successfully recorded a pernosco session. A link to the pernosco session will be added here shortly.

Not sure what the CC related question here is. We're deleting an object very very late, after CC has been shut down for the thread.

Flags: needinfo?(smaug)

A pernosco session for this bug can be found here.

The model how WorkerRefs are used seems a bit suspicious, but haven't yet looked at it more closely.
https://searchfox.org/mozilla-central/rev/5ef39aaccb8f660fb9c9c98ee76e0985ab6398d2/layout/style/FontFaceSetWorkerImpl.cpp#39-49
https://searchfox.org/mozilla-central/rev/5ef39aaccb8f660fb9c9c98ee76e0985ab6398d2/layout/style/FontFaceSetWorkerImpl.cpp#136,139

If I comment out that mWorkerRef = nullptr; I can't reproduce the crash anymore.

This bug has been marked as a regression. Setting status flag for Nightly to affected.

This isn't a regression. The test case is just over-constrained to depend on something that was first exposed in bug 1131324 but with high probability if the test case used navigator.locks the same thing would happen.

Keywords: regression
No longer regressed by: 1131324

Andrew, do you think there is a more appropriate component for this than Graphics? The team isn't really feeling that it's something on our plate.

Flags: needinfo?(bugmail)

Indeed, the pernosco trace is showing that the refcount is getting dropped to 1 at a reasonable time during DisconnectGlobalTeardownObservers but we're not finalizing the FontFace until we destroy the runtime and so then we don't perform the final release until slightly later in the process of destroying the runtime. Which is weird given that the worker runtime is being very classy and calling nsCycleCollector_shutdown.

This is somewhere in the interaction of Workers, the Cycle Collector, or the JS engine, so I'm going to move this to Workers for now.

:smaug, :mccr8, do you have ideas on why we're not sweeping the FontFace during nsCycleCollector_shutdown?

Component: Graphics → DOM: Workers
Flags: needinfo?(smaug)
Flags: needinfo?(continuation)
Flags: needinfo?(bugmail)

So a definite problem is that WorkerNavigator is not nulling out mServiceWorkerContainer in mozilla::dom::WorkerNavigator::Invalidate or tracing it, which is definitely my bad and definitely a regression. This means the WeakRef gets to stay alive and I guess thanks to the complexities of GC (or limitations of), it probably results in keeping the (presumably closed over) array siblings alive too late. If we fix the failure to aggressively null out the strong ref and the CC tracing, presumably the problem will go away.

Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(smaug)
Flags: needinfo?(continuation)
Keywords: regression
Regressions: 1131324

For posterity, here's a screenshot of a portion of a class diagram of WorkerNavigator made by searchfox that shows that all the other fields are cc-traversed. (They are indirectly unlinked, however, which is why the almost documented emoji don't show the unlink emoji sequence. Searchfox doesn't currently pierce the indirect unlink logic.) (The diagram is semi-ironic because I added the functionality to help avoid situations like this, although searchfox doesn't run against pull requests.)

Severity: -- → S3
Priority: -- → P3
Regressed by: 1131324
No longer regressions: 1131324
Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/66b1ea756834 Properly traverse and unlink ServiceWorkerContainer. r=smaug,dom-worker-reviewers
Attachment #9461620 - Flags: approval-mozilla-beta?

beta Uplift Approval Request

  • User impact if declined: Reduces chance for us seeing weird worker crashes or memory leaks in the wild. More importantly, if we do see memory leaks or weird crashes in the wild and this was uplifted, we can know it's not from this bug. If we don't uplift and see weird stuff, we have to worry about this bug. The fuzzer scenario itself I think is hopefully somewhat unlikely, but if sites turn on more ServiceWorkers for Firefox, we could see interesting variants.
  • Code covered by automated testing: yes
  • Fix verified in Nightly: no
  • Needs manual QE test: no
  • Steps to reproduce for manual QE testing: n/a
  • Risk associated with taking this patch: Very low
  • Explanation of risk level: This moves cleanup from a sketchy/too-late time to the correct, very safe time.
  • String changes made/needed: none
  • Is Android affected?: yes
Status: ASSIGNED → RESOLVED
Closed: 18 days ago
Resolution: --- → FIXED
Target Milestone: --- → 136 Branch
Attachment #9461620 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Verified bug as fixed on rev mozilla-central 20250125093830-ef11a9be8e75.
Removing bugmon keyword as no further action possible. Please review the bug and re-add the keyword for further analysis.

Status: RESOLVED → VERIFIED
Keywords: bugmon
Duplicate of this bug: 1942924

Copying crash signatures from duplicate bugs.

Crash Signature: [@ get]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: