Crash in [@ memset | MaybePoison] due to a CPU bug in memset with some versions of VCRUNTIME140.dll
Categories
(Core :: mozglue, defect)
Tracking
()
People
(Reporter: yannis, Unassigned)
References
Details
(Whiteboard: [win:stability][tbird crash])
Crash Data
This crash signature is currently quite high volume in 115.0.2 release.
Example crash report: here.
Example call stack:
# Child-SP RetAddr Call Site
00 0000007d`c743f168 00007ffb`4888f40e VCRUNTIME140!memset+0x1f2 [D:\a\_work\1\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 339]
01 (Inline Function) --------`-------- mozglue!MaybePoison+0xa [/builds/worker/checkouts/gecko/memory/build/mozjemalloc.cpp @ 1501]
02 (Inline Function) --------`-------- mozglue!arena_dalloc+0x4a [/builds/worker/checkouts/gecko/memory/build/mozjemalloc.cpp @ 3740]
03 (Inline Function) --------`-------- mozglue!BaseAllocator::free+0x67 [/builds/worker/checkouts/gecko/memory/build/mozjemalloc.cpp @ 4547]
04 (Inline Function) --------`-------- mozglue!Allocator<MozJemallocBase>::free+0x67 [/builds/worker/checkouts/gecko/memory/build/malloc_decls.h @ 54]
05 0000007d`c743f170 00007ffb`253e1646 mozglue!je_free+0x9e [/builds/worker/checkouts/gecko/memory/build/malloc_decls.h @ 54]
06 (Inline Function) --------`-------- xul!mozilla::DefaultDelete<IPC::Message>::operator()+0xbf [/builds/worker/workspace/obj-build/dist/include/mozilla/UniquePtr.h @ 459]
07 (Inline Function) --------`-------- xul!mozilla::UniquePtr<IPC::Message,mozilla::DefaultDelete<IPC::Message> >::reset+0xc8 [/builds/worker/workspace/obj-build/dist/include/mozilla/UniquePtr.h @ 301]
08 (Inline Function) --------`-------- xul!mozilla::UniquePtr<IPC::Message,mozilla::DefaultDelete<IPC::Message> >::~UniquePtr+0xc8 [/builds/worker/workspace/obj-build/dist/include/mozilla/UniquePtr.h @ 249]
09 (Inline Function) --------`-------- xul!IPC::Channel::ChannelImpl::OutputQueuePop+0x155 [/builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/ipc_channel_win.cc @ 107]
0a (Inline Function) --------`-------- xul!IPC::Channel::ChannelImpl::ProcessOutgoingMessages+0xaca [/builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/ipc_channel_win.cc @ 553]
0b 0000007d`c743f260 00007ffb`253ddf55 xul!IPC::Channel::ChannelImpl::OnIOCompleted+0xb36 [/builds/worker/checkouts/gecko/ipc/chromium/src/chrome/common/ipc_channel_win.cc @ 650]
0c (Inline Function) --------`-------- xul!base::MessagePumpForIO::WaitForIOCompletion+0x1ae [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_pump_win.cc @ 490]
0d 0000007d`c743f4b0 00007ffb`238e1a55 xul!base::MessagePumpForIO::DoRunLoop+0x285 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_pump_win.cc @ 443]
0e (Inline Function) --------`-------- xul!base::MessagePumpWin::RunWithDispatcher+0x3d [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_pump_win.cc @ 59]
0f 0000007d`c743f670 00007ffb`24299e4f xul!base::MessagePumpWin::Run+0x55 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_pump_win.h @ 79]
10 (Inline Function) --------`-------- xul!MessageLoop::RunInternal+0x16 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 368]
11 0000007d`c743f6d0 00007ffb`238e14a2 xul!MessageLoop::RunHandler+0x2f [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 362]
12 (Inline Function) --------`-------- xul!MessageLoop::Run+0x46 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 343]
13 0000007d`c743f720 00007ffb`24296001 xul!base::Thread::ThreadMain+0x192 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/thread.cc @ 187]
14 0000007d`c743f900 00007ffb`6b577614 xul!`anonymous namespace'::ThreadFunc+0x11 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/platform_thread_win.cc @ 20]
15 0000007d`c743f930 00007ffb`4887c538 kernel32!BaseThreadInitThunk+0x14
16 (Inline Function) --------`-------- mozglue!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,void (*)(int, void *, void *)>::operator()+0x15 [/builds/worker/checkouts/gecko/toolkit/xre/dllservices/mozglue/nsWindowsDllInterceptor.h @ 150]
17 0000007d`c743f960 00007ffb`6d4226b1 mozglue!patched_BaseThreadInitThunk+0x28 [/builds/worker/checkouts/gecko/toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp @ 617]
18 0000007d`c743f9d0 00000000`00000000 ntdll!RtlUserThreadStart+0x21
We are crashing in memset
from VCRUNTIME140.dll
. Even though we ship this Microsoft DLL in the Firefox directory (currently in version 14.29.30139.0), if a system-wide version of the DLL is found in C:\Windows\System32
, the system-wide version will be favored at load time. This explains why the DLL version (and code) is not the same for everyone.
Most crashes in 115.0.2 are with system-wide installed versions 14.32.31332.0 (36%), 14.36.32532.0 (13%), and 14.31.31103.0 (12%). The majority of crashes are with a specific CPU model (family 23 model 1 stepping 1), which suggests a CPU bug.
For the crashes in VCRUNTIME140.dll
versions 14.32.31332.0 and 14.31.31103.0, the reported error is 0xc0000096
(privileged instruction). We crash because we are not aligned on the proper start of an intended instruction, and hence we attempt to execute another instruction than the intended one (which turns out to be privileged).
0:005> u rip
VCRUNTIME140!memset+0x1f2:
# The unintended instruction where we crash
00007ffb`5bf21b92 e701 out 1,eax
0:005> u rip-2
VCRUNTIME140!memset+0x1f0 [D:\a\_work\1\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 339]:
# The larger instruction to which these bytes are supposed to belong to
00007ffb`5bf21b90 c5fde701 vmovntdq ymmword ptr [rcx],ymm0
00007ffb`5bf21b94 c5fde74120 vmovntdq ymmword ptr [rcx+20h],ymm0
00007ffb`5bf21b99 c5fde74140 vmovntdq ymmword ptr [rcx+40h],ymm0
00007ffb`5bf21b9e c5fde74160 vmovntdq ymmword ptr [rcx+60h],ymm0
...
In VCRUNTIME140.dll
version 14.36.32532.0, the situation is similar but with error code c0000005
(access violation) and a different instruction:
0:001> u rip
VCRUNTIME140!memset+0x146 [D:\a\_work\1\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 264]:
# The unintended instruction where we crash
00007ffa`501b1b06 8180000000c5fd7f81a0 add dword ptr [rax-3B000000h],0A0817FFDh
0:001> u rip-0x16
VCRUNTIME140!memset+0x130 [D:\a\_work\1\s\src\vctools\crt\vcruntime\src\string\amd64\memset.asm @ 260]:
00007ffa`501b1af0 c5fd7f01 vmovdqa ymmword ptr [rcx],ymm0
00007ffa`501b1af4 c5fd7f4120 vmovdqa ymmword ptr [rcx+20h],ymm0
00007ffa`501b1af9 c5fd7f4140 vmovdqa ymmword ptr [rcx+40h],ymm0
00007ffa`501b1afe c5fd7f4160 vmovdqa ymmword ptr [rcx+60h],ymm0
# The larger instruction to which these bytes are supposed to belong to
00007ffa`501b1b03 c5fd7f8180000000 vmovdqa ymmword ptr [rcx+80h],ymm0
00007ffa`501b1b0b c5fd7f81a0000000 vmovdqa ymmword ptr [rcx+0A0h],ymm0
...
Comment 1•2 years ago
|
||
All crashes with this reason are on Windows 10/11. The spike also seems to have started right around the July patch Tuesday. I'm thinking faulty Windows update maybe?
Comment 2•2 years ago
|
||
Take a look at the graphs of the signature and aggregate by "reason". You will see that the most common issue (54% of the crashes over the last 6 months) is EXCEPTION_ACCESS_VIOLATION_WRITE. That signature ramps up over time starting in 111 and would be concerning all on its own. It's 54% of crashes with this signature over 6 months. This is heap memory corruption of some kind: in the process of freeing allocated memory we are crashing when trying to write the 0xe5e5e5e5 pattern on memory we think we still own right before we free it. Number of crashes by major version:
115 394 33.56 %
114 324 27.60 %
113 237 20.19 %
112 115 9.80 %
111 91 7.75 %
Although the above signature is worst in 115.0.2, the big visual spike in in this crash signature is because EXCEPTION_PRIV_INSTRUCTION gets added to the mix. That's only 15% of the crashes overall, but 82% of those crashes (264) happen in 115.0.2 specifically The only 115 version that's affected. There were earlier rumblings of trouble: 56 crashes in 114.0.1, and then a single crash each in 112.0.1 and 112.0.2. Zero crashes with that "reason" on any other version! That doesn't comport well with the problem being (only) VCRUNTIME140.dll, because presumably some of the people with 114.0.2 or 115.0.1 etc. also had those same vcruntime versions. Could be memory corruption of the executable memory (as opposed to the heap in the access-violation case above), but why would that be so version specific or tied to VCRUNTIME140.dll versions? It could be corruption of the build itself, but then you'd expect lots and lots more people to have trouble here.
The 3rd most common signature (13%) is SIGSEGV / SEGV_ACCERR
. In theory this is the Linux version of the windows EXCEPTION_ACCESS_VIOLATION, but this one is primarily a problem in 113, and only a small number (22) of 115 crashes.
And another odd one is EXCEPTION_SINGLE_STEP. That should normally only happen when you're debugging a program, to facilitate single-stepping through the code. If something sets the trap flag and you're not in a debugger then you get this. No program is going to set that on purpose, but if our execution pointer got off somehow (like it appears to in the PRIV_INSTRUCTION case) then it might be possible to interpret something wrongly as an ICE instruction. Or maybe if we're setting other flags these people had the bad luck of a bit flip in the trap bit. Maybe? But then why so version-specific? 111 had 34% of these, and they steadily decline with 115 having 7.5% of them. The numbers are smaller, but it's a really odd one that seems like corruption of something. This signature is overwhelmingly 32-bit (97%)
Maybe there will be clues by looking at the kinds of things that are being freed higher up the stack. I saw a wide range of different things, but I didn't look at enough stacks to get a sense of whether there were clumps of different functionality that might be different bugs.
Comment 3•2 years ago
|
||
The spike also seems to have started right around the July patch Tuesday. I'm thinking faulty Windows update maybe?
It's also the day we shipped 115.0.2, the version that had a spike of crashes. A smaller echo spike happened with 115.0.3esr which presumably had the same code.
Reporter | ||
Comment 5•2 years ago
•
|
||
(In reply to Daniel Veditz [:dveditz] from comment #2)
That signature ramps up over time starting in 111 and would be concerning all on its own.
Note: The attached signature didn't exist before 111 because bug 1808429 introduced MaybePoison
on the 111 branch, to replace what were previously direct calls to memset
. If these crashes already existed in 110 and earlier, they would be stored under a different signature.
Reporter | ||
Comment 6•2 years ago
•
|
||
(In reply to Daniel Veditz [:dveditz] from comment #2)
Although the above signature is worst in 115.0.2, the big visual spike in in this crash signature is because EXCEPTION_PRIV_INSTRUCTION gets added to the mix. That's only 15% of the crashes overall, but 82% of those crashes (264) happen in 115.0.2 specifically The only 115 version that's affected. There were earlier rumblings of trouble: 56 crashes in 114.0.1, and then a single crash each in 112.0.1 and 112.0.2. Zero crashes with that "reason" on any other version!
All these 328 EXCEPTION_PRIV_INSTRUCTION
crashes are with family 23 model 1 stepping 1
CPU info, so this portion of the crashes should definitely be a CPU bug. Also interesting is that even though the signature exists for 115.0.3esr, we have no instance of EXCEPTION_PRIV_INSTRUCTION
there, despite the source code being mostly the same as 115.0.2. I agree that these additional considerations make it unlikely that this is a problem with vcruntime140.dll
alone after all.
If we do the reverse search and look on which versions family 23 model 1 stepping 1
CPUs have crashed, we get the following top 5:
1 115.0.2 396 79.36 %
2 114.0.1 71 14.23 %
3 112.0.1 7 1.40 %
4 111.0.1 5 1.00 %
5 114.0.2 5 1.00 %
The CPU bug is probably favored by bad luck with our builds of mozglue.dll
for 114.0.1 and 115.0.2, and provoked by the interaction between these builds of mozglue.dll
and some versions of vcruntime140.dll
. If this occurs again with 116 release, a rebuild could do the trick, like it seems to have (unintentionally) worked for 114.0.2.
Then, the rest of the volume could indeed have a different root cause than the CPU bug, as you mention.
Reporter | ||
Updated•2 years ago
|
Comment 7•2 years ago
|
||
The severity field is not set for this bug.
:glandium, could you have a look please?
For more information, please visit BugBot documentation.
Updated•1 year ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•3 months ago
|
Description
•