Closed Bug 1746733 Opened 4 years ago Closed 4 years ago

Crash in [@ sqlite3AddGenerated]

Categories

(Core :: Networking, defect)

Unspecified
Windows 7
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr91 --- unaffected
firefox95 + fixed
firefox96 --- unaffected
firefox97 --- unaffected

People

(Reporter: gsvelto, Unassigned)

References

Details

(Keywords: crash, csectype-wildptr, sec-high)

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/7474b8e0-5f44-4cce-8953-3ffc20211218

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 5 frames of crashing thread:

0 nss3.dll sqlite3AddGenerated third_party/sqlite3/src/sqlite3.c:114246
1 None @0x000007fdffffffff 
2 nss3.dll ssl_SecureRecv security/nss/lib/ssl/sslsecur.c:803
3 nss3.dll ssl_Recv security/nss/lib/ssl/sslsock.c:3174
4 xul.dll PSMRecv security/manager/ssl/nsNSSIOLayer.cpp:1309

This crash makes no sense, I suspect the stack is corrupted. I'm filing it under networking because the first frame (sqlite) seems very unlikely to be correct, the upper frames are in SSL code and it's happening on a socket thread. Some notes:

  • This is happening only on 95.0.1
  • This seems to be happening only on 64-bit builds on Windows 7, 8 and 8.1. We have no 32-bit crashes and no Windows 10 & 11 crashes so it seems both arch and OS-specific
  • All the crashes have the 0x000007fdffffffff address on the stack which makes me think the first frame is bogus
  • This is spiking really fast, in the comments many users complain having suffered several crashes, seemingly at random

This looks to be a sec-high due to writes to non-NULL pointer addresses. As gsvelto says, this appears to be impacting many users repeatedly.

Ryan, Dana - It looks like this started happening when 95.0.1 hit Release. Could this be due to the fix landed for https://bugzilla.mozilla.org/show_bug.cgi?id=1745600 ? The fix for it was landed in https://bugzilla.mozilla.org/show_bug.cgi?id=966856 .

Needinfo'ing dveditz as a heads up so he can keep this on his radar.

Group: core-security
Flags: needinfo?(ryanvm)
Flags: needinfo?(dveditz)
Flags: needinfo?(dkeeler)

[Tracking Requested - why for this release]:
seems that is a release regression
(please keep in mind that, AFAIK, we are processing only 10% of the crashes)

Flags: needinfo?(pascalc)
Flags: needinfo?(dsmith)

Adding Pascal & Dianna (the release owner)

Flags: needinfo?(bbeurdouche)

Pascal, what about disabling updates?

(In reply to Sylvestre Ledru [:Sylvestre] from comment #5)

Pascal, what about disabling updates?

This is done, 46% of our users are on 95.0.1, note that people staying on a lower version than 95.0.1 don't have access to microsoft sites though.

All the crashes are on AMD family 20 CPUs - Bobcat. That's a family which has been known to cause random crash spikes in the past.

Flags: needinfo?(ryanvm)

Just for good measure, I did have a look at the patch from bug 966856, and I really couldn't see how it could be causing memory safety issues (much less issues on the socket thread, whereas that code runs on the certificate verification threads).

Flags: needinfo?(dkeeler)

I believe both the graphics and JS teams have investigated Bobcat crashes in the past. Jeff or Jan, do either of your teams have any test machines handy that could be used for trying to reproduce the crashes on 95.0.1 and (hopefully) the lack of crashes in 95.0.2 being built now?

Flags: needinfo?(jmuizelaar)
Flags: needinfo?(jdemooij)

The first stack frame is actually the only reliable piece of information we have. The problem is that the instruction pointer is in the middle of an instruction:

(...)
   18009a229:   0f 28 b4 24 e0 00 00    movaps xmm6,XMMWORD PTR [rsp+0xe0]
   18009a230:   00 
   18009a231:   0f 28 bc 24 f0 00 00    movaps xmm7,XMMWORD PTR [rsp+0xf0]
   18009a238:   00 
   18009a239:   44 0f 28 84 24 00 01    movaps xmm8,XMMWORD PTR [rsp+0x100]
-> 18009a240:   00 00 
   18009a242:   44 0f 28 8c 24 10 01    movaps xmm9,XMMWORD PTR [rsp+0x110]
   18009a249:   00 00 
   18009a24b:   44 0f 28 94 24 20 01    movaps xmm10,XMMWORD PTR [rsp+0x120]
   18009a252:   00 00 
   18009a254:   48 81 c4 38 01 00 00    add    rsp,0x138
   18009a25b:   5b                      pop    rbx
   18009a25c:   5d                      pop    rbp
   18009a25d:   5f                      pop    rdi
   18009a25e:   5e                      pop    rsi
   18009a25f:   41 5c                   pop    r12
   18009a261:   41 5d                   pop    r13
   18009a263:   41 5e                   pop    r14
   18009a265:   41 5f                   pop    r15

The -> is where we are. How we get there would be the interesting question...
That instruction is seen as add byte ptr [rax], al.

There is no point in the previous instructions where we'd have a valid instruction that finishes at that address, so we have to have jumped there directly, but no register contains the address.

These are all on Windows versions 7, 8, 8.1 - no windows 10. narrow range of AMD CPUs (Family 20 models 1 and 2, 5 different microcode version)

Moving to a more generally accessible security group since it's not clear where the problem actually is. Could it be build gremlins?

Group: core-security → core-security-release
Flags: needinfo?(dveditz)
See Also: → 772330

I just realized why we're not seeing Windows 10 crashes: the errata cited in bug 772330 comment 55 mentions that an update was planned, if it was done it was presumably shipped as a microcode update. Microsoft started shipping microcode updates automatically in Windows 10, but never shipped them on prior versions.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #9)

I believe both the graphics and JS teams have investigated Bobcat crashes in the past. Jeff or Jan, do either of your teams have any test machines handy that could be used for trying to reproduce the crashes on 95.0.1 and (hopefully) the lack of crashes in 95.0.2 being built now?

The JS team had some Bobcat machines. According to Ted they're now in the Toronto office, forwarding NI to him...

Flags: needinfo?(jdemooij) → needinfo?(tcampbell)

I added two Bobcat based laptops in the Toronto office with the GFX test laptops stash. The JS team's previous investigation was Bug 1281759 but we removed the sensitive code when we added the Warp JIT so closed the issue there.

In agreement with Comment 10 and 11, the general behaviour of the AMD Bobcat bug is that CPU branch defects end up jumping to incorrect addresses and then executing bogus sequences of instructions which gives the crash. Generally all our attempts to dodge this behaviour in the JITs had no effect, and eventually that subsystem that was unlucky we removed entirely for other reasons. Using test devices I was only able to reproduce once or twice and it did not end up leading anywhere.

When I looked into this about four years ago, the estimate from Data Science was that there were in the ballpark of 200k users on these devices.

Flags: needinfo?(dsmith)
Flags: needinfo?(bbeurdouche)

Fixed by the 95.0.2 rebuild. Opening the bug as well as this isn't an actionable security issue with Firefox.

Group: core-security-release
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(tcampbell)
Flags: needinfo?(jmuizelaar)
Resolution: --- → FIXED

(In reply to Gabriele Svelto [:gsvelto] from comment #14)

I just realized why we're not seeing Windows 10 crashes: the errata cited in bug 772330 comment 55 mentions that an update was planned, if it was done it was presumably shipped as a microcode update. Microsoft started shipping microcode updates automatically in Windows 10, but never shipped them on prior versions.

The update came with newer versions of AGESA.
https://github.com/coreboot/coreboot/blob/master/src/vendorcode/amd/agesa/f14/Proc/CPU/Family/0x14/ON/F14OnInitEarlyTable.c#L295
Even if microsoft had microcode patching since Vista (and 7 and 8 also had an update explicitly for these cpus, as I mentioned in bug 772330 comment 60) that couldn't have changed anything.

I suppose there's even the remote chance be that they may be manually applying the msr fix themselves in W10 (just like linux 4.14+)... But my very uneducated guess is that if there was no crash there it's just some of the spectre/meltdown/anything mitigations affecting the stack enough that the bug doesn't trigger.

You need to log in before you can comment on or make changes to this bug.