1746733 - Crash in [@ sqlite3AddGenerated]

Gabriele Svelto [:gsvelto]

Reporter

Description

•

4 years ago

Crash report: https://crash-stats.mozilla.org/report/index/7474b8e0-5f44-4cce-8953-3ffc20211218

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 5 frames of crashing thread:

0 nss3.dll sqlite3AddGenerated third_party/sqlite3/src/sqlite3.c:114246
1 None @0x000007fdffffffff 
2 nss3.dll ssl_SecureRecv security/nss/lib/ssl/sslsecur.c:803
3 nss3.dll ssl_Recv security/nss/lib/ssl/sslsock.c:3174
4 xul.dll PSMRecv security/manager/ssl/nsNSSIOLayer.cpp:1309

This crash makes no sense, I suspect the stack is corrupted. I'm filing it under networking because the first frame (sqlite) seems very unlikely to be correct, the upper frames are in SSL code and it's happening on a socket thread. Some notes:

This is happening only on 95.0.1
This seems to be happening only on 64-bit builds on Windows 7, 8 and 8.1. We have no 32-bit crashes and no Windows 10 & 11 crashes so it seems both arch and OS-specific
All the crashes have the 0x000007fdffffffff address on the stack which makes me think the first frame is bogus
This is spiking really fast, in the comments many users complain having suffered several crashes, seemingly at random

Maire Reavy [:mreavy]

Comment 1

•

4 years ago

This looks to be a sec-high due to writes to non-NULL pointer addresses. As gsvelto says, this appears to be impacting many users repeatedly.

Ryan, Dana - It looks like this started happening when 95.0.1 hit Release. Could this be due to the fix landed for https://bugzilla.mozilla.org/show_bug.cgi?id=1745600 ? The fix for it was landed in https://bugzilla.mozilla.org/show_bug.cgi?id=966856 .

Needinfo'ing dveditz as a heads up so he can keep this on his radar.

Group: core-security

Flags: needinfo?(ryanvm)

Flags: needinfo?(dveditz)

Flags: needinfo?(dkeeler)

Keywords: csectype-wildptr, sec-high

Sylvestre Ledru [:Sylvestre]

Comment 2

•

4 years ago

[Tracking Requested - why for this release]:
seems that is a release regression
(please keep in mind that, AFAIK, we are processing only 10% of the crashes)

status-firefox95: --- → affected

tracking-firefox95: --- → ?

Flags: needinfo?(pascalc)

Flags: needinfo?(dsmith)

Sylvestre Ledru [:Sylvestre]

Comment 3

•

4 years ago

Adding Pascal & Dianna (the release owner)

Sylvestre Ledru [:Sylvestre]

Updated

•

4 years ago

Flags: needinfo?(bbeurdouche)

Pascal Chevrel:pascalc (PTO until Sept 2)

Comment 4

•

4 years ago

Here is the changelog for the dot release:
https://hg.mozilla.org/releases/mozilla-release/pushloghtml?fromchange=FIREFOX_95_0_RELEASE&tochange=FIREFOX_95_0_1_RELEASE&full&version=2

Flags: needinfo?(pascalc)

Sylvestre Ledru [:Sylvestre]

Comment 5

•

4 years ago

Pascal, what about disabling updates?

Pascal Chevrel:pascalc (PTO until Sept 2)

Updated

•

4 years ago

tracking-firefox95: ? → +

Pascal Chevrel:pascalc (PTO until Sept 2)

Comment 6

•

4 years ago

(In reply to Sylvestre Ledru [:Sylvestre] from comment #5)

Pascal, what about disabling updates?

This is done, 46% of our users are on 95.0.1, note that people staying on a lower version than 95.0.1 don't have access to microsoft sites though.

Ryan VanderMeulen [:RyanVM]

Comment 7

•

4 years ago

All the crashes are on AMD family 20 CPUs - Bobcat. That's a family which has been known to cause random crash spikes in the past.

Flags: needinfo?(ryanvm)

Dana Keeler (she/her) [:keeler]

Comment 8

•

4 years ago

Just for good measure, I did have a look at the patch from bug 966856, and I really couldn't see how it could be causing memory safety issues (much less issues on the socket thread, whereas that code runs on the certificate verification threads).

Flags: needinfo?(dkeeler)

Ryan VanderMeulen [:RyanVM]

Comment 9

•

4 years ago

I believe both the graphics and JS teams have investigated Bobcat crashes in the past. Jeff or Jan, do either of your teams have any test machines handy that could be used for trying to reproduce the crashes on 95.0.1 and (hopefully) the lack of crashes in 95.0.2 being built now?

Flags: needinfo?(jmuizelaar)

Flags: needinfo?(jdemooij)

Mike Hommey [:glandium]

Comment 10

•

4 years ago

•

Edited

The first stack frame is actually the only reliable piece of information we have. The problem is that the instruction pointer is in the middle of an instruction:

(...)
   18009a229:   0f 28 b4 24 e0 00 00    movaps xmm6,XMMWORD PTR [rsp+0xe0]
   18009a230:   00 
   18009a231:   0f 28 bc 24 f0 00 00    movaps xmm7,XMMWORD PTR [rsp+0xf0]
   18009a238:   00 
   18009a239:   44 0f 28 84 24 00 01    movaps xmm8,XMMWORD PTR [rsp+0x100]
-> 18009a240:   00 00 
   18009a242:   44 0f 28 8c 24 10 01    movaps xmm9,XMMWORD PTR [rsp+0x110]
   18009a249:   00 00 
   18009a24b:   44 0f 28 94 24 20 01    movaps xmm10,XMMWORD PTR [rsp+0x120]
   18009a252:   00 00 
   18009a254:   48 81 c4 38 01 00 00    add    rsp,0x138
   18009a25b:   5b                      pop    rbx
   18009a25c:   5d                      pop    rbp
   18009a25d:   5f                      pop    rdi
   18009a25e:   5e                      pop    rsi
   18009a25f:   41 5c                   pop    r12
   18009a261:   41 5d                   pop    r13
   18009a263:   41 5e                   pop    r14
   18009a265:   41 5f                   pop    r15

The -> is where we are. How we get there would be the interesting question...
That instruction is seen as add byte ptr [rax], al.

Mike Hommey [:glandium]

Comment 11

•

4 years ago

There is no point in the previous instructions where we'd have a valid instruction that finishes at that address, so we have to have jumped there directly, but no register contains the address.

Daniel Veditz [:dveditz]

Comment 12

•

4 years ago

These are all on Windows versions 7, 8, 8.1 - no windows 10. narrow range of AMD CPUs (Family 20 models 1 and 2, 5 different microcode version)

Moving to a more generally accessible security group since it's not clear where the problem actually is. Could it be build gremlins?

Group: core-security → core-security-release

Flags: needinfo?(dveditz)

Mike Hommey [:glandium]

Comment 13

•

4 years ago

This does fit bug 772330 comment 55.

Daniel Veditz [:dveditz]

Updated

•

4 years ago

Comment 14

•

4 years ago

•

Edited

I just realized why we're not seeing Windows 10 crashes: the errata cited in bug 772330 comment 55 mentions that an update was planned, if it was done it was presumably shipped as a microcode update. Microsoft started shipping microcode updates automatically in Windows 10, but never shipped them on prior versions.

Jan de Mooij [:jandem]

Comment 15

•

4 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #9)

I believe both the graphics and JS teams have investigated Bobcat crashes in the past. Jeff or Jan, do either of your teams have any test machines handy that could be used for trying to reproduce the crashes on 95.0.1 and (hopefully) the lack of crashes in 95.0.2 being built now?

The JS team had some Bobcat machines. According to Ted they're now in the Toronto office, forwarding NI to him...

Flags: needinfo?(jdemooij) → needinfo?(tcampbell)

Ted Campbell [:tcampbell]

Comment 16

•

4 years ago

I added two Bobcat based laptops in the Toronto office with the GFX test laptops stash. The JS team's previous investigation was Bug 1281759 but we removed the sensitive code when we added the Warp JIT so closed the issue there.

Ted Campbell [:tcampbell]

Comment 17

•

4 years ago

In agreement with Comment 10 and 11, the general behaviour of the AMD Bobcat bug is that CPU branch defects end up jumping to incorrect addresses and then executing bogus sequences of instructions which gives the crash. Generally all our attempts to dodge this behaviour in the JITs had no effect, and eventually that subsystem that was unlucky we removed entirely for other reasons. Using test devices I was only able to reproduce once or twice and it did not end up leading anywhere.

When I looked into this about four years ago, the estimate from Data Science was that there were in the ballpark of 200k users on these devices.

Sylvestre Ledru [:Sylvestre]

Updated

•

4 years ago

Flags: needinfo?(dsmith)

Flags: needinfo?(bbeurdouche)

Chris Peterson [:cpeterson]

Updated

•

4 years ago

status-firefox96: --- → unaffected

status-firefox97: --- → unaffected

status-firefox-esr91: --- → unaffected

OS: Windows → Windows 7

Ryan VanderMeulen [:RyanVM]

Comment 18

•

4 years ago

Fixed by the 95.0.2 rebuild. Opening the bug as well as this isn't an actionable security issue with Firefox.

Group: core-security-release

Status: NEW → RESOLVED

Closed: 4 years ago

status-firefox95: affected → fixed

Flags: needinfo?(tcampbell)

Flags: needinfo?(jmuizelaar)

Resolution: --- → FIXED

mirh

Comment 19

•

4 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #14)

I just realized why we're not seeing Windows 10 crashes: the errata cited in bug 772330 comment 55 mentions that an update was planned, if it was done it was presumably shipped as a microcode update. Microsoft started shipping microcode updates automatically in Windows 10, but never shipped them on prior versions.

The update came with newer versions of AGESA.
https://github.com/coreboot/coreboot/blob/master/src/vendorcode/amd/agesa/f14/Proc/CPU/Family/0x14/ON/F14OnInitEarlyTable.c#L295
Even if microsoft had microcode patching since Vista (and 7 and 8 also had an update explicitly for these cpus, as I mentioned in bug 772330 comment 60) that couldn't have changed anything.

I suppose there's even the remote chance be that they may be manually applying the msr fix themselves in W10 (just like linux 4.14+)... But my very uneducated guess is that if there was no crash there it's just some of the spectre/meltdown/anything mitigations affecting the stack enough that the bug doesn't trigger.

Bugzilla

Crash in [@ sqlite3AddGenerated]

Categories

(Core :: Networking, defect)

Tracking

()

People

(Reporter: gsvelto, Unassigned)

References

Details

(Keywords: crash, csectype-wildptr, sec-high)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Comment 14

Comment 15

Comment 16

Comment 17

Updated

Updated

Comment 18

Comment 19