Open Bug 1865569 Opened 2 years ago Updated 2 days ago

Crash in [@ arena_t::MallocSmall | arena_t::Malloc | BaseAllocator::malloc | MozJemalloc::malloc]

Categories

(Core :: Memory Allocator, defect)

x86_64
Unspecified
defect

Tracking

()

Tracking Status
firefox122 --- affected

People

(Reporter: release-mgmt-account-bot, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, topcrash)

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/c9cddd00-2acb-4ee8-a4ec-6a9790231104

MOZ_CRASH Reason: MOZ_DIAGNOSTIC_ASSERT(run->mMagic == 0x384adf93)

Top 10 frames of crashing thread:

0  firefox-bin  arena_t::MallocSmall  memory/build/mozjemalloc.cpp:3296
0  firefox-bin  arena_t::Malloc  memory/build/mozjemalloc.cpp:3344
0  firefox-bin  BaseAllocator::malloc  memory/build/mozjemalloc.cpp:4564
0  firefox-bin  MozJemalloc::malloc  memory/build/malloc_decls.h:51
0  firefox-bin  PageMalloc  memory/build/PHC.cpp:1309
0  firefox-bin  MozJemallocPHC::malloc  memory/build/PHC.cpp:1313
0  firefox-bin  ReplaceMalloc::malloc  memory/build/malloc_decls.h:51
0  firefox-bin  malloc  memory/build/malloc_decls.h:51
0  firefox-bin  moz_xmalloc  memory/mozalloc/mozalloc.cpp:52
1  libxul.so  operator new  memory/mozalloc/cxxalloc.h:33

By querying Nightly crashes reported within the last 2 months, here are some insights about the signature:

  • First crash report: 2023-10-25
  • Process type: Multiple distinct types
  • Is startup crash: No
  • Has user comments: No
  • Is null crash: Yes - 3 out of 4 crashes happened on null or near null memory address
Component: General → Memory Allocator

How does this make sense? To get a crash address of 0x0 like in the linked crash report, in the test "run->mMagic == ARENA_RUN_MAGIC", you'd need run to be null. Except three lines above we have this:

if (MOZ_UNLIKELY(!run)) {
    return nullptr;
}

IOW, a null run should have returned.

Duh, the crash address comes from MOZ_DIAGNOSTIC_ASSERT and is not relevant. Since this is happening during allocation, this means this is not a case where the address of the run is not that of a run. So what this means is that some other code wrote over the magic number via buffer overflow...

There's three different crash reasons under this signature with the first one being by far the most common:

  • MOZ_RELEASE_ASSERT(mNode)
  • MOZ_DIAGNOSTIC_ASSERT(run->mMagic == 0x384adf93)
  • MOZ_DIAGNOSTIC_ASSERT(run->mNumFree > 0)
    Cracking open minidumps might tell us what those values are, and if they're caused by bit-flips or a real problem. I'm NI?ing myself to do it when I have some free time.
Flags: needinfo?(gsvelto)

No luck here, the crash happens within deeply inlined code so it's very hard to recover the values of the variables. I'll try to manually look at the disassembly and see if I can figure something out form those but I make no promises.

Flags: needinfo?(gsvelto)
Severity: -- → S3

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on release

:pbone, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(pbone)
Keywords: topcrash

I was looking at these crashes with Jens and I noticed that several of the crashes have two threads in the memory allocator at the same time, see for example a5a84af4-a76f-400a-b39a-fe6ab0251210. The crashing thread is the IPC I/O Child thread doing a malloc() and the main thread is also doing a malloc() (but blocked on a lock). That's a pretty strong hint there might be a problem in the allocator itself.

Only looking at the majority of the volume here, these crashes come from Trend Micro users with MOZ_RELEASE_ASSERT(mNode) as a crash reason. This gets reflected in the loaded modules with the presence of Trend Micro DLLs.

Taking an example crash and disassembling at RIP shows:

0:001> u rip
mozglue!arena_t::MallocSmall+0x1c80 [/builds/worker/checkouts/gecko/memory/build/mozjemalloc.cpp @ 2746] [inlined in mozglue!moz_xmalloc+0x1dab [/builds/worker/checkouts/gecko/memory/mozalloc/mozalloc.cpp @ 52]]:
00007ffa`fdcb3ffb cc              int     3
00007ffa`fdcb3ffc b9b9000000      mov     ecx,0B9h
00007ffa`fdcb4001 e829450400      call    mozglue!MOZ_NoReturn (00007ffa`fdcf852f)

B9 is the line value for MOZ_NoReturn(line); at this call site, and so we are at line 185 in RedBlackTree.h, so in [@ RedBlackTree<T>::TreeNode::SetColor]. Hence, this is a variation of bug 1872261. I'm not sure why the inlining info seems confused here and the signature changed (perhaps this part is worth its own investigation), but the assembly code leaves no doubt about this fact.

So... Either we broke something in our blocklist code ourselves, or Trend Micro successfully pushed a bypass to our blocklist code without addressing the underlying issue that caused them to be blocked in the first place.

Component: Memory Allocator → Other
Depends on: 1872261
Flags: needinfo?(pbone)
OS: All → Windows
Product: Core → External Software Affecting Firefox
Hardware: x86 → x86_64
Summary: Crash in [@ arena_t::MallocSmall | arena_t::Malloc | BaseAllocator::malloc | MozJemalloc::malloc] → Crash in [@ arena_t::MallocSmall | arena_t::Malloc | BaseAllocator::malloc | MozJemalloc::malloc] with Trend Micro
Version: unspecified → Firefox 145
Severity: S3 → S2

Nevermind, I'll file a new bug for the Trend Micro part since the bug was originally not about that. Sorry.

Severity: S2 → S3
Component: Other → Memory Allocator
No longer depends on: 1872261
OS: Windows → Unspecified
Product: External Software Affecting Firefox → Core
See Also: → 1872261
Summary: Crash in [@ arena_t::MallocSmall | arena_t::Malloc | BaseAllocator::malloc | MozJemalloc::malloc] with Trend Micro → Crash in [@ arena_t::MallocSmall | arena_t::Malloc | BaseAllocator::malloc | MozJemalloc::malloc]
Version: Firefox 145 → unspecified
See Also: → 2005777

(In reply to Yannis Juglaret [:yannis] from comment #7)

Only looking at the majority of the volume here

:jstutte wanted more precise numbers about this: here they are. Over the last six months, we have received 1353 Firefox crashes on this specific signature. Out of those 1353, 1208 show the presence of a Trend Micro DLL, so 89% overall. But the crashes with Trend Micro DLLs only started in release 143.0 (which matches with the discovery of bug 1872261). Interestingly, no crashes with Trend Micro DLLs in releases 144.0 (the first version with the uplifted patch from bug bug 1872261) and 144.0.2. But they are back in 145.0, 145.0.1, 145.0.2. The prevalence is particularly high for 145.0.2, where we're at 721 out of 729, so 99% of crashes have Trend Micro DLLs.

Below are versions of Firefox for which we received crashes with Trend Micro DLLs, ordered by volume:

Total: 1208
145.0.2: 721
143.0.1: 188
145.0: 130
145.0.1: 113
143.0: 12
144.0b4: 7
143.0b4: 6
143.0b5: 5
143.0b9: 5
143.0b7: 4
143.0rc1: 3
143.0.3: 2
143.0b6: 2
144.0b2: 2
144.0b3: 2
145.0rc2: 2
143.0b3: 1
144.0b5: 1
145.0b1: 1
145.0b2: 1

And the same for crashes without Trend Micro DLLs:

Total: 145
128.13.0esr: 39
121.0.1: 14
145.0.1: 9
145.0.2: 8
143.0.1: 6
140.0.2: 5
142.0.1: 5
140.0.4: 4
141.0.3: 4
145.0: 4
128.14.0esr: 3
136.0b3: 3
140.4.0esr: 3
144.0.2: 3
127.0: 2
128.12.0esr: 2
139.0.4: 2
142.0: 2
143.0.4: 2
144.0b5: 2
146.0b0: 2
121.0: 1
127.0a1: 1
128.11.0esr: 1
128.5.1esr: 1
128.6.0esr: 1
128.7.0esr: 1
129.0.2: 1
130.0: 1
132.0.2: 1
135.0b1: 1
140.5.0esr: 1
140.6.0esr: 1
141.0.2: 1
142.0b3: 1
143.0a1: 1
143.0b0: 1
143.0b3: 1
144.0b1: 1
146.0: 1
146.0a1: 1
146.0b6: 1
You need to log in before you can comment on or make changes to this bug.