Closed Bug 1837408 Opened 2 years ago Closed 2 years ago

Crash in [@ mozilla::LinkedListElement<T>::getNext]

Categories

(Core :: JavaScript Engine, defect, P1)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
116 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox114 + fixed
firefox115 --- fixed
firefox116 --- fixed

People

(Reporter: aryx, Assigned: sfink)

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(1 file)

This crash signature existed before but got more frequent with Firefox 114.0. All crashes are reported for Windows with ~40% in the first minute after launch. 32 crashes from 32 installations after 1 day of rollout to 25% of users (4% effective update rate).

Crash report: https://crash-stats.mozilla.org/report/index/0a2d49e9-3270-4759-8986-3350f0230608

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0  xul.dll  mozilla::LinkedListElement<js::jit::IonCompileTask>::getNext  mfbt/LinkedList.h:208
0  xul.dll  mozilla::LinkedList<js::jit::IonCompileTask>::getFirst  mfbt/LinkedList.h:531
0  xul.dll  CancelOffThreadIonCompileLocked  js/src/vm/HelperThreads.cpp:423
0  xul.dll  js::CancelOffThreadIonCompile  js/src/vm/HelperThreads.cpp:440
1  xul.dll  js::CancelOffThreadIonCompile  js/src/vm/HelperThreads.h:176
1  xul.dll  js::Nursery::discardJitCodeForZone  js/src/gc/Nursery.cpp:422
2  xul.dll  js::Nursery::updateAllocFlagsForZone  js/src/gc/Nursery.cpp:417
2  xul.dll  js::Nursery::updateAllZoneAllocFlags  js/src/gc/Nursery.cpp:403
2  xul.dll  js::Nursery::disable  js/src/gc/Nursery.cpp:371
3  xul.dll  js::gc::GCRuntime::finish  js/src/gc/GC.cpp:900

Will, could you please have your team look at this crash? This seems to be spiking with the release, thanks!

Flags: needinfo?(wmedina)

This is most likely caused by the change in bug 1830921, which means that we now cancel JIT compilation during shutdown. This should be a problem though because we already do this slightly earlier in JSRuntime::destroyRuntime.

Looking at CancelOffThreadIonCompileLocked it looks like the problem is null pointer deref here:

https://searchfox.org/mozilla-central/source/js/src/vm/HelperThreads.cpp#423-424

However I don't see that can happen as we already check whether jitRuntime is present before calling this function, and it not deleted until after this point in destroyRuntime.

As a quick fix for release we could skip discarding JIT code when disabling the nursery during shutdown, but that wouldn't fix the underlying issue.

Jan, do you have any thoughts on what could be going wrong here?

Flags: needinfo?(wmedina) → needinfo?(jdemooij)

(In reply to Jon Coppeard (:jonco) from comment #2)

Looking at CancelOffThreadIonCompileLocked it looks like the problem is null pointer deref here:

https://searchfox.org/mozilla-central/source/js/src/vm/HelperThreads.cpp#423-424

However I don't see that can happen as we already check whether jitRuntime is present before calling this function, and it not deleted until after this point in destroyRuntime.

There's another potential null deref in runtime->jitRuntime()->ionLazyLinkList(runtime).getFirst();. I think you're talking about the JitRuntime* being nullptr. The return value of ionLazyLinkList(runtime) is also dereferenced without an offset:

   0x000000000ddf8587 <+1415>:	call   0xede9ea0 <_ZN2js3jit10JitRuntime15ionLazyLinkListEP9JSRuntime>
   0x000000000ddf858c <+1420>:	mov    (%rax),%rbx
   0x000000000ddf858f <+1423>:	cmpb   $0x0,0x10(%rbx)  ; my guess is this is the mIsSentinel field

(note that this is from a mismatched Linux DEBUG binary). I don't see how that's possible either, and I haven't checked the binary to see which deref it is. (Nor do I know how to.) But this deref better matches the reported stack.

getFirst() is { return sentinel.getNext() }. getNext() is { return mNext->asT(); } where mNext is at offset 0. asT() is { return mIsSentinel ? ... }. mIsSentinel looks to be at offset 0x10. That makes %rbx a LinkedListElement<IonCompileTask>*, I guess? Or equivalently, just an IonCompileTask* given that it inherits from the former (with a zero-length earlier base type). So the mov (%rax), %rbx would be loading an IonCompileTask* out of a LinkedList<IonCompileTask>, and the latter has a sentinel at offset 0 that getFirst() calls getNext() on. Therefore the mov looks to be the call to .getFirst() on the return value of runtime->jitRuntime()->ionLazyLinkList(runtime).

But ionLazyLinkList just returns &jitRuntime->ionLazyLinkList_.ref(), and I don't see how that could evaluate to a nullptr regardless of what jitRuntime is? I'm probably reading something wrong. The whole LinkedList/LinkedListElement thing confuses me when mentally translating to assembly.

Maybe this helps, or maybe it's a distraction. I can't come up with a story where we could have torn something down resulting in the nullptr dereference here.

Oh, wait. I looked at the crash report again, and the nullptr deref is a lie. The crash address is 0x1e8, from the instruction mov r15, qword [rax + 0x1e8]. On my linux opt build, 0x1e0 is ionLazyLinkList_ and there's nothing else for 24 bytes. I could believe it might be 0x1e8 in the actual binary. This makes it look a lot more like a null JitRuntime* dereference. Ignore my speculation above.

I don't see anything that would ever make jitRuntime() return nullptr, though. Even in JSRuntime::destroyRuntime(), it just calls js_delete on it. It doesn't clear the field. The only time I see it being nullptr is before it is set in the first place, or if it fails to initialize in createJitRuntime().

I'm trying to look through the failure handling for eg https://searchfox.org/mozilla-central/source/dom/workers/RuntimeService.cpp#2127 but I have to run now. I guess I'm wondering if there's a way we could destroy a runtime that failed to initialize? There are two ScopeExit cleanup handlers that would fire in that case, but I don't think they do anything like that.

There's a call from within CycleCollectedJSContext::Initialize, but the jit runtime won't exist and JitDataStructuresExist checks for that.

How about this one:

#250622 = RuntimeService.cpp:uint32 mozilla::dom::workerinternals::{anonymous}::WorkerThreadPrimaryRunnable::Run()
#1622193 = mozilla::UniquePtr<T, D>::~UniquePtr() [with T = mozilla::dom::WorkerJSContext; D = mozilla::DefaultDelete<mozilla::dom::WorkerJSContext>]
#1623392 = mozilla::UniquePtr<T, D>::~UniquePtr() [with T = mozilla::dom::WorkerJSContext; D = mozilla::DefaultDelete<mozilla::dom::WorkerJSContext>] [[base_dtor]] []
#1623390 = void mozilla::UniquePtr<T, D>::reset(mozilla::UniquePtr<T, D>::Pointer) [with T = mozilla::dom::WorkerJSContext; D = mozilla::DefaultDelete<mozilla::dom::WorkerJSContext>; mozilla::UniquePtr<T, D>::Pointer = mozilla::dom::WorkerJSContext*]
#1624290 = void mozilla::DefaultDelete<T>::operator()(T*) const [with T = mozilla::dom::WorkerJSContext]
#250587 = void mozilla::dom::WorkerJSContext::~WorkerJSContext()
#250589 = _ZN7mozilla3dom15WorkerJSContextD1Ev [:]
#1621914 = void mozilla::dom::WorkerJSContext::~WorkerJSContext() [[base_dtor]] []
#1617107 = _ZN7mozilla23CycleCollectedJSContextD2Ev
#1171660 = void JS_DestroyContext(JSContext*)
#2128442 = void js::DestroyContext(JSContext*)
#2128441 = void JSRuntime::destroyRuntime()

(ignore the numbers). The story here is that we create the WorkerJSContext, then try to initialize here or here and fail. Somehow, bizarrely, we manage to create a JIT runtime before the failure. Then when context goes out of scope, it tears things down. The creation stack would be:

#250622 = RuntimeService.cpp:uint32 mozilla::dom::workerinternals::{anonymous}::WorkerThreadPrimaryRunnable::Run()
#1621887 = RuntimeService.cpp:uint8 mozilla::dom::workerinternals::{anonymous}::InitJSContextForWorker(mozilla::dom::WorkerPrivate*, JSContext*)
#1171654 = uint8 JS::InitSelfHostedCode(JSContext*, mozilla::Span<const unsigned char>, (uint8)(JSContext*,mozilla::Span<const unsigned char>)*)
#2141923 = uint8 JSRuntime::createJitRuntime(JSContext*)

(I have a weirder stack that could do it, but it seems pretty unlikely.)

Ok, so here's my hypothesis for someone to shoot down: we fail during initialization. The GC may or may not be initialized so we might skip the earlier opportunity to CancelOffThreadIonCompile, but there's no jit runtime and so it doesn't do anything anyway. We finish the GC, which now disables the nursery. And now because true == false, we somehow bypass the JitDataStructuresExist check and proceed onward to crash.

I will admit the true == false part needs some work. In this path, we are using the Zone* case, so apparently jitZone() is mysteriously returning true. That only requires getJitZone() to have been called because it will automatically create one. This is impossible according to an assert but we're in optimized crazytown here, so we don't care about asserts. [Actually, see below, it can be temporarily non-null here. The assert would pass.] Let's say createJitZone was called by:

#250622 = RuntimeService.cpp:uint32 mozilla::dom::workerinternals::{anonymous}::WorkerThreadPrimaryRunnable::Run()
#1621887 = RuntimeService.cpp:uint8 mozilla::dom::workerinternals::{anonymous}::InitJSContextForWorker(mozilla::dom::WorkerPrivate*, JSContext*)
#1171654 = uint8 JS::InitSelfHostedCode(JSContext*, mozilla::Span<const unsigned char>, (uint8)(JSContext*,mozilla::Span<const unsigned char>)*)
#2141923 = uint8 JSRuntime::createJitRuntime(JSContext*)
#2151812 = uint8 js::jit::JitRuntime::initialize(JSContext*)
#2061520 = uint8 js::jit::JitRuntime::generateBaselineICFallbackCode(JSContext*)
#2046891 = js::jit::JitCode* js::jit::Linker::newCode(JSContext*, uint8)
#1980318 = js::jit::JitZone* JS::Zone::getJitZone(JSContext*)
#1980320 = js::jit::JitZone* JS::Zone::createJitZone(JSContext*)

Oh! Wait, this may work. createJitRuntime sets the jitRuntime_ field then fails to initialize and clears it, then goes on to tear things down as described above. But during its failed attempt at initialization, it managed to set a Zone's jitZone_ to non-null.

Sorry, that's kind of stream of consciousness and I have to run now before I can clean it up, but I think this is all possible. jandem, what do you think?

The surgical fix would be to check for the jitRuntime in Nursery::updateAllocFlagsForZone before doing anything. But I don't know what the right fix is.

Assignee: nobody → sphink
Status: NEW → ASSIGNED
Assignee: nobody → sphink
Status: NEW → ASSIGNED

(In reply to Steve Fink [:sfink] [:s:] from comment #5)
Good catch!

From crash stats, uptime values are small and system memory use values are large, which backs up the idea we are hitting OOM soon after startup.

(In reply to Jon Coppeard (:jonco) from comment #7)

(In reply to Steve Fink [:sfink] [:s:] from comment #5)
Good catch!

From crash stats, uptime values are small and system memory use values are large, which backs up the idea we are hitting OOM soon after startup.

I'm assuming DOM Workers can be created well after startup, so uptime may not need to be small for this explanation to work.

I can reproduce the crash with

diff --git a/js/src/jit/BaselineJIT.cpp b/js/src/jit/BaselineJIT.cpp
--- a/js/src/jit/BaselineJIT.cpp
+++ b/js/src/jit/BaselineJIT.cpp
@@ -999,4 +999,7 @@ uint8_t* BaselineInterpreter::retAddrFor
 bool jit::GenerateBaselineInterpreter(JSContext* cx,
                                       BaselineInterpreter& interpreter) {
+  ReportOutOfMemory(cx);
+  return false;
+
   if (IsBaselineInterpreterEnabled()) {
     TempAllocator temp(&cx->tempLifoAlloc());

(GenerateBaselineInterpreter is just the last thing that can fail in initialization; I could probably move the failure up earlier.)

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash
Attachment #9338208 - Attachment description: Bug 1837408 - clean up partially-initialized zones → Bug 1837408 - Skip atoms zone when updating nursery flags
Severity: -- → S2
Priority: -- → P1
Flags: needinfo?(jdemooij)
Pushed by sfink@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/abf8f424d2d0 Skip atoms zone when updating nursery flags r=jandem

Backed out for causing Sm bustages in PublicIterators.h

Flags: needinfo?(sphink)
Pushed by sfink@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7183c524620f Skip atoms zone when updating nursery flags r=jandem
Flags: needinfo?(sphink)
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch

The patch landed in nightly and beta is affected.
:sfink, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox115 to wontfix.

For more information, please visit BugBot documentation.

Flags: needinfo?(sphink)

Comment on attachment 9338208 [details]
Bug 1837408 - Skip atoms zone when updating nursery flags

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes when running out of memory. Usually happens during startup, though at least from the code I would expect it to be possible at other times (unless we prepopulate a pool of workers?)

Note that this only happens when we fail to initialize, and fixing this won't make the initialization work (it's failing by running out of memory when initializing a DOM worker), but at least the crash will be accounted in the OOM bucket and there might be cases where we can recover; I'm not sure.

  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: (I can reproduce by injecting a fake OOM by changing code, but there's nothing in the code that allows that now.)
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): On the failure path already. Reverting much closer to previous behavior, before the regression was introduced.
  • String changes made/needed: none
  • Is Android affected?: Yes
Flags: needinfo?(sphink)
Attachment #9338208 - Flags: approval-mozilla-release?
Attachment #9338208 - Flags: approval-mozilla-beta?

Comment on attachment 9338208 [details]
Bug 1837408 - Skip atoms zone when updating nursery flags

Approved for 115.0b7.

Attachment #9338208 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9338208 [details]
Bug 1837408 - Skip atoms zone when updating nursery flags

Approved for 114.0.2, thanks.

Attachment #9338208 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: