Crash in [@ sqlite3AddGenerated]
Categories
(Core :: Networking, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox95 | + | fixed |
firefox96 | --- | unaffected |
firefox97 | --- | unaffected |
People
(Reporter: gsvelto, Unassigned)
References
Details
(Keywords: crash, csectype-wildptr, sec-high)
Crash Data
Crash report: https://crash-stats.mozilla.org/report/index/7474b8e0-5f44-4cce-8953-3ffc20211218
Reason: EXCEPTION_ACCESS_VIOLATION_WRITE
Top 5 frames of crashing thread:
0 nss3.dll sqlite3AddGenerated third_party/sqlite3/src/sqlite3.c:114246
1 None @0x000007fdffffffff
2 nss3.dll ssl_SecureRecv security/nss/lib/ssl/sslsecur.c:803
3 nss3.dll ssl_Recv security/nss/lib/ssl/sslsock.c:3174
4 xul.dll PSMRecv security/manager/ssl/nsNSSIOLayer.cpp:1309
This crash makes no sense, I suspect the stack is corrupted. I'm filing it under networking because the first frame (sqlite) seems very unlikely to be correct, the upper frames are in SSL code and it's happening on a socket thread. Some notes:
- This is happening only on 95.0.1
- This seems to be happening only on 64-bit builds on Windows 7, 8 and 8.1. We have no 32-bit crashes and no Windows 10 & 11 crashes so it seems both arch and OS-specific
- All the crashes have the 0x000007fdffffffff address on the stack which makes me think the first frame is bogus
- This is spiking really fast, in the comments many users complain having suffered several crashes, seemingly at random
Comment 1•4 years ago
|
||
This looks to be a sec-high due to writes to non-NULL pointer addresses. As gsvelto says, this appears to be impacting many users repeatedly.
Ryan, Dana - It looks like this started happening when 95.0.1 hit Release. Could this be due to the fix landed for https://bugzilla.mozilla.org/show_bug.cgi?id=1745600 ? The fix for it was landed in https://bugzilla.mozilla.org/show_bug.cgi?id=966856 .
Needinfo'ing dveditz as a heads up so he can keep this on his radar.
Comment 2•4 years ago
|
||
[Tracking Requested - why for this release]:
seems that is a release regression
(please keep in mind that, AFAIK, we are processing only 10% of the crashes)
Comment 3•4 years ago
|
||
Adding Pascal & Dianna (the release owner)
Updated•4 years ago
|
Comment 4•4 years ago
|
||
Here is the changelog for the dot release:
https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=FIREFOX_95_0_RELEASE&tochange=FIREFOX_95_0_1_RELEASE&full&version=2
Comment 5•4 years ago
|
||
Pascal, what about disabling updates?
Updated•4 years ago
|
Comment 6•4 years ago
|
||
(In reply to Sylvestre Ledru [:Sylvestre] from comment #5)
Pascal, what about disabling updates?
This is done, 46% of our users are on 95.0.1, note that people staying on a lower version than 95.0.1 don't have access to microsoft sites though.
Comment 7•4 years ago
|
||
All the crashes are on AMD family 20 CPUs - Bobcat. That's a family which has been known to cause random crash spikes in the past.
![]() |
||
Comment 8•4 years ago
|
||
Just for good measure, I did have a look at the patch from bug 966856, and I really couldn't see how it could be causing memory safety issues (much less issues on the socket thread, whereas that code runs on the certificate verification threads).
Comment 9•4 years ago
|
||
I believe both the graphics and JS teams have investigated Bobcat crashes in the past. Jeff or Jan, do either of your teams have any test machines handy that could be used for trying to reproduce the crashes on 95.0.1 and (hopefully) the lack of crashes in 95.0.2 being built now?
Comment 10•4 years ago
•
|
||
The first stack frame is actually the only reliable piece of information we have. The problem is that the instruction pointer is in the middle of an instruction:
(...)
18009a229: 0f 28 b4 24 e0 00 00 movaps xmm6,XMMWORD PTR [rsp+0xe0]
18009a230: 00
18009a231: 0f 28 bc 24 f0 00 00 movaps xmm7,XMMWORD PTR [rsp+0xf0]
18009a238: 00
18009a239: 44 0f 28 84 24 00 01 movaps xmm8,XMMWORD PTR [rsp+0x100]
-> 18009a240: 00 00
18009a242: 44 0f 28 8c 24 10 01 movaps xmm9,XMMWORD PTR [rsp+0x110]
18009a249: 00 00
18009a24b: 44 0f 28 94 24 20 01 movaps xmm10,XMMWORD PTR [rsp+0x120]
18009a252: 00 00
18009a254: 48 81 c4 38 01 00 00 add rsp,0x138
18009a25b: 5b pop rbx
18009a25c: 5d pop rbp
18009a25d: 5f pop rdi
18009a25e: 5e pop rsi
18009a25f: 41 5c pop r12
18009a261: 41 5d pop r13
18009a263: 41 5e pop r14
18009a265: 41 5f pop r15
The ->
is where we are. How we get there would be the interesting question...
That instruction is seen as add byte ptr [rax], al
.
Comment 11•4 years ago
|
||
There is no point in the previous instructions where we'd have a valid instruction that finishes at that address, so we have to have jumped there directly, but no register contains the address.
Comment 12•4 years ago
|
||
These are all on Windows versions 7, 8, 8.1 - no windows 10. narrow range of AMD CPUs (Family 20 models 1 and 2, 5 different microcode version)
Moving to a more generally accessible security group since it's not clear where the problem actually is. Could it be build gremlins?
Comment 13•4 years ago
|
||
This does fit bug 772330 comment 55.
Reporter | ||
Comment 14•4 years ago
•
|
||
I just realized why we're not seeing Windows 10 crashes: the errata cited in bug 772330 comment 55 mentions that an update was planned, if it was done it was presumably shipped as a microcode update. Microsoft started shipping microcode updates automatically in Windows 10, but never shipped them on prior versions.
Comment 15•4 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #9)
I believe both the graphics and JS teams have investigated Bobcat crashes in the past. Jeff or Jan, do either of your teams have any test machines handy that could be used for trying to reproduce the crashes on 95.0.1 and (hopefully) the lack of crashes in 95.0.2 being built now?
The JS team had some Bobcat machines. According to Ted they're now in the Toronto office, forwarding NI to him...
Comment 16•4 years ago
|
||
I added two Bobcat based laptops in the Toronto office with the GFX test laptops stash. The JS team's previous investigation was Bug 1281759 but we removed the sensitive code when we added the Warp JIT so closed the issue there.
Comment 17•4 years ago
|
||
In agreement with Comment 10 and 11, the general behaviour of the AMD Bobcat bug is that CPU branch defects end up jumping to incorrect addresses and then executing bogus sequences of instructions which gives the crash. Generally all our attempts to dodge this behaviour in the JITs had no effect, and eventually that subsystem that was unlucky we removed entirely for other reasons. Using test devices I was only able to reproduce once or twice and it did not end up leading anywhere.
When I looked into this about four years ago, the estimate from Data Science was that there were in the ballpark of 200k users on these devices.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 18•4 years ago
|
||
Fixed by the 95.0.2 rebuild. Opening the bug as well as this isn't an actionable security issue with Firefox.
Comment 19•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #14)
I just realized why we're not seeing Windows 10 crashes: the errata cited in bug 772330 comment 55 mentions that an update was planned, if it was done it was presumably shipped as a microcode update. Microsoft started shipping microcode updates automatically in Windows 10, but never shipped them on prior versions.
The update came with newer versions of AGESA.
https://github.com/coreboot/coreboot/blob/master/src/vendorcode/amd/agesa/f14/Proc/CPU/Family/0x14/ON/F14OnInitEarlyTable.c#L295
Even if microsoft had microcode patching since Vista (and 7 and 8 also had an update explicitly for these cpus, as I mentioned in bug 772330 comment 60) that couldn't have changed anything.
I suppose there's even the remote chance be that they may be manually applying the msr fix themselves in W10 (just like linux 4.14+)... But my very uneducated guess is that if there was no crash there it's just some of the spectre/meltdown/anything mitigations affecting the stack enough that the bug doesn't trigger.
Description
•