Closed Bug 1788004 Opened 2 years ago Closed 1 year ago

startup Crash in [@ uuid::Uuid::new_v4]

Categories

(Toolkit :: Telemetry, defect)

x86
Windows 7
defect

Tracking

()

VERIFIED FIXED
112 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox104 --- unaffected
firefox105 --- wontfix
firefox106 --- wontfix
firefox110 --- wontfix
firefox111 --- verified
firefox112 --- verified

People

(Reporter: aryx, Assigned: yannis)

References

Details

(Keywords: crash, Whiteboard: [win:stability])

Crash Data

Attachments

(1 file)

40 crash reports from 8+ installations, all on Windows 7 or 8.1 and all with 32-bit builds of 105 betas. >90% in the first minute after launch.

Could this be from the changes in bug 1781030? There hadn't been a crash during the Nightly development cycle or 104 betas.

Crash report: https://crash-stats.mozilla.org/report/index/3680107d-572f-4da2-80ca-b52b90220830

MOZ_CRASH Reason: could not retreive random bytes for uuid: OS Error: 1073741947

Top 10 frames of crashing thread:

0 xul.dll RustMozCrash mozglue/static/rust/wrappers.cpp:17
1 xul.dll mozglue_static::panic_hook mozglue/static/rust/lib.rs:91
2 xul.dll core::ops::function::Fn::call<void  ../4b91a6ea7258a947e59c6522cd5898e7c0a6a88f/library/core/src/ops/function.rs:248
3 xul.dll std::panicking::rust_panic_with_hook library/std/src/panicking.rs:702
4 xul.dll std::panicking::begin_panic_handler::closure$0 library/std/src/panicking.rs:588
5 xul.dll std::sys_common::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::closure_env$0, never$> library/std/src/sys_common/backtrace.rs:138
6 xul.dll std::panicking::begin_panic_handler library/std/src/panicking.rs:584
7 xul.dll core::panicking::panic_fmt library/core/src/panicking.rs:142
8 xul.dll uuid::Uuid::new_v4 third_party/rust/uuid/src/v4.rs:38
9 xul.dll glean_core::metrics::ping::PingType::submit_sync third_party/rust/glean-core/src/metrics/ping.rs:141
Flags: needinfo?(jrediger)

bug 1781030 didn't change anything about how we use uuid::Uuid::new_v4, nor have we touched that particular crashing code path.
The uuid crate was updated from 0.8.1 to 0.8.2 in v103.
That uuid release changed from the rand crate to getrandom. That introduced that panic on a getrandom error.

That still doesn't explain why we're only seeing this now.
We could downgrade to 0.8.1 if this causes that many crashes.

Flags: needinfo?(jrediger)

Does that mean we should try to get a similar fallback into getrandom now or do we rather go the downgrade route for now?

So I tried to traceback how uuid uses rand and how rand then works.
rand also uses getrandom under the hood, so we at some point end up calling BCryptGenRandom, but probably not as often, because rand uses that just as a seed for its chacha implementation (if I understood it right).
I'm still not sure whether that means it might succeed the time we need it and further calls to create uuids will simply not call it.

I guess we have some options here for now:

  • We downgrade back to uuid 0.8.1, which goes the rand way. We haven't updated rand in a while, so this might "fix" it for now.
  • Or we implement the same fallback algorithm for getrandom is easy
    • but getrandom decided to explicitly only use BCryptGenRandom a while back, so I'm not sure how much leverage we have convincing them to take this fallback.
    • Of course we can use the patched version, but then we have all the downsides of a fork (missing out on updates when they happen upstream, ...).
  • we pull in the minimal things needed from uuid and getrandom into Glean and then can patch it up as we need it (essentially also a fork with the same issues, but given the smaller API maaaybe ok) not really an option, as it turns out we need quite a bit more than just v4 UUIDs.

Since the crash volume is low (less than 5 per week), the severity is downgraded to S3. Feel free to change it back if you think the bug is still critical.

For more information, please visit auto_nag documentation.

Severity: S2 → S3
Duplicate of this bug: 1804020
Crash Signature: [@ uuid::Uuid::new_v4] → [@ uuid::Uuid::new_v4] [@ uuid::v4::impl$0::new_v4::closure$0]

but getrandom decided to explicitly only use BCryptGenRandom a while back, so I'm not sure how much leverage we have convincing them to take this fallback.

AFAIK cmartin successfully argued them to reverse this in the stdlib. Note that one justification in that thread is "that's what Firefox does" and as you discovered here (and we also did in previous bugs), "what Firefox does" is actually broken on older Windows.

Flags: needinfo?(cmartin)

It's worth nothing that Rust StdLib eventually removed RtlGenRandom again after my original fix and went a different route with it. The getrandom crate has decided to do the same, and there is currently a WIP PR to fix this, but it's been in limbo for the last 5 weeks (presumably due to time constraints). Perhaps it would be a good idea to help the maintainers of that crate get this fixed soon?

Flags: needinfo?(cmartin)

Hello,

I did a bit of automated analysis on 71 crashes. In particular I tried to identify and restore recurring strings on the stack of the thread that crashes. Among those 71 crashes there were actually two different reasons for crashing:

  1. for 67 crashes: could not retreive random bytes for uuid: OS Error: 1073741947 which maps to NTSTATUS 0xC000007B STATUS_INVALID_IMAGE_FORMAT;
  2. for 4 crashes: could not retreive random bytes for uuid: OS Error: 1073742053 which maps to NTSTATUS 0xC00000E5 STATUS_INTERNAL_ERROR.

Crashes of type (1) appear to be failures to load the SysWOW64 variant of the bcryptprimitives.dll library from code that lives in bcrypt.dll. I recovered recurring partial strings \SysWOW64\bcryptpr and W64\bcryptprimitives.dll, which indicate that it is likely that an attempt to load this library occurred shortly before crashing. bcrypt!_LoadImage is reachable from bcrypt!BCryptGenRandom when used with BCRYPT_USE_SYSTEM_PREFERRED_RNG, and it contains code that looks as follows, which makes it the most likely location where this error originated from:

if (!LoadLibraryW(...)) {
   return GetLastError() == ERROR_MOD_NOT_FOUND ? STATUS_DLL_NOT_FOUND : STATUS_INVALID_IMAGE_FORMAT;
}

The problem with this code is that as users of BCryptGenRandom we do not have access to the original last error value but only to this STATUS_INVALID_IMAGE_FORMAT (and for some reason, there doesn't seem to be last errors attached to these crashes). Recovering the last error may give us a better explanation why we failed to load the library: we could try to load bcryptprimitives.dll manually within Firefox and catch the last error value if we want to explore this further. Note that this doesn't appear to be a sandboxing issue, as the crashes seem to be main process crashes.

Crashes of type (2) appear to be linked to a product from the InfoTeCS company (who name their products itcs-*). This product likely registers as a preferred RNG provider for the system but sometimes fails to generate numbers. I recovered the following partial strings from the stacks of these 4 crashes:
itcs-cng-provider 0xe2c itcsrng_msinterface.cpp:242 itccng::bcrypt::ItcsRng_MsInterface::GenRandom <- !!!FA;
itcs-cng-provider 0xba0 itcsrng_msinterface.cpp:242 itccng::bcrypt::ItcsRng_MsInterface::GenRandom <- !!!FA;
itcs-cng-provider 0x1368 itcsrng_msinterface.cpp:242 itccng::bcrypt::ItcsRng_MsInterface::GenRandom <- !!!F;
itcs-cng-provider 0xf04 itcsrng_msinterface.cpp:242 itccng::bcrypt::ItcsRng_MsInterface::GenRandom <- !!!FA.

Given that we have at least 2 different reasons why BCryptGenRandom can fail, it seems to me that having a fallback solution was a good idea.

(In reply to Yannis Juglaret from comment #9)

The problem with this code is that as users of BCryptGenRandom we do not have access to the original last error value but only to this STATUS_INVALID_IMAGE_FORMAT (and for some reason, there doesn't seem to be last errors attached to these crashes).

The last error value is held in the TEB and we only capture that in nightly. Given how useful this is I was thinking of extending this to beta and release. We don't want all crash reports to become bigger because of it, so maybe we could do it statistically, like stuff the minidump with extra data once every two beta crashes and once every twenty release ones. I'll file a bug for that.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on beta

:chutten, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(chutten)
Keywords: topcrash

I'd love to be able to do more beyond incrementing the severity, but all I can do is add more background:

  • uuid was just upped to 1.0 in bug 1804359, but getrandom is still unconditionally using BCryptGenRandom in their latest, so I don't expect any improvement.
  • I've added links to a previous rustlang issue for this as well as getrandom's own issue for this fallback. Jan-Erik, it looks as though they are (or were as of October) amenable to adding a fallback. Does that change your impression of how such a patch might be received?
Severity: S3 → S2
Flags: needinfo?(chutten) → needinfo?(jrediger)

(In reply to Chris H-C :chutten from comment #12)

  • I've added links to a previous rustlang issue for this as well as getrandom's own issue for this fallback. Jan-Erik, it looks as though they are (or were as of October) amenable to adding a fallback. Does that change your impression of how such a patch might be received?

We sure can try

Flags: needinfo?(jrediger)

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Whiteboard: [win:stability]

(In reply to Jan-Erik Rediger [:janerik] from comment #13)

We sure can try

Thanks [:janerik] for working on this! I played with the patch, and while it does provide a fallback, the fallback itself also relies on successfully loading bcryptprimitives.dll, which seems to be what's failing for the vast majority of our crash reports (see comment 9). So I don't think this would be enough to fix our stability issue, except maybe for the few crashes related to InfoTeCS.

I think the stability issue can only be fixed by a fallback that doesn't rely on successful loading of bcryptprimitives.dll, so one that uses previous APIs such as ADVAPI32!RtlGenRandom like mentioned in comment 8. If you believe that getrandom won't agree to have this fallback, then maybe we should move on our own about this? For example, I can try to write a patch to hook BCryptGenRandom on Windows 7, and implement the RtlGenRandom-based fallback in that hook?

To reach this conclusion, I played with a test executable in a WinDbg, and simulated the scenario where trying to load bcryptprimitives.dll fails. In that case, both the original path and the fallback path end up failing, and rust ultimately panics with the same error as the one I recovered for most crashes in comment 9. Below are the two call stacks that lead to loading bcryptprimitives.dll, the first is in the original path, the second is in the fallback path.

...
// First call stack
0:000> k
 # Child-SP          RetAddr               Call Site
00 000000e8`53cfeff8 00007ff9`ba8f6ff5     KERNELBASE!LoadLibraryExW
01 000000e8`53cff000 00007ff9`ba8f6ad9     bcrypt!_LoadImage+0xa1
02 000000e8`53cff060 00007ff9`ba8f48c1     bcrypt!LoadProviderEx+0x189
03 000000e8`53cff0f0 00007ff9`ba8f18d6     bcrypt!BCryptOpenAlgorithmProvider+0x701
04 000000e8`53cff500 00007ff9`ba8f177c     bcrypt!OpenSystemPreferredAlgorithmProvider+0x9a
05 000000e8`53cff560 00007ff6`0e022a0e     bcrypt!BCryptGenRandom+0x1ac
06 000000e8`53cff5e0 00007ff6`0e02264e     bcrypt_exe!getrandom::imp::Rng::random+0x9e [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\windows.rs @ 128] 
07 000000e8`53cff680 00007ff6`0e02217a     bcrypt_exe!getrandom::imp::getrandom_inner+0xee [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\windows.rs @ 65] 
08 000000e8`53cff770 00007ff6`0e02221f     bcrypt_exe!getrandom::getrandom_uninit+0x7a [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\lib.rs @ 335] 
09 000000e8`53cff800 00007ff6`0e021e1d     bcrypt_exe!getrandom::getrandom+0x3f [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\lib.rs @ 307] 
0a 000000e8`53cff8a0 00007ff6`0e021488     bcrypt_exe!uuid::rng::bytes+0x2d [C:\repos\rust\bcrypt\uuid\src\rng.rs @ 7] 
0b 000000e8`53cff8e0 00007ff6`0e021011     bcrypt_exe!uuid::Uuid::new_v4+0x18 [C:\repos\rust\bcrypt\uuid\src\v4.rs @ 34] 
...
// First parameter
0:000> du rcx
000001e8`7ce0a700  "C:\WINDOWS\system32\bcryptprimit"
000001e8`7ce0a740  "ives.dll"
// Simulating a failure to load the library
0:000> r rip=poi(rsp); r rsp=rsp+8; r rax=0; g
...
// Second call stack
0:000> k
 # Child-SP          RetAddr               Call Site
00 000000e8`53cfefc8 00007ff9`ba8f6ff5     KERNELBASE!LoadLibraryExW
01 000000e8`53cfefd0 00007ff9`ba8f6ad9     bcrypt!_LoadImage+0xa1
02 000000e8`53cff030 00007ff9`ba8f48c1     bcrypt!LoadProviderEx+0x189
03 000000e8`53cff0c0 00007ff6`0e0227f3     bcrypt!BCryptOpenAlgorithmProvider+0x701
04 000000e8`53cff4d0 00007ff6`0e022a8b     bcrypt_exe!getrandom::imp::Rng::open+0xe3 [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\windows.rs @ 100] 
05 000000e8`53cff5d0 00007ff6`0e022681     bcrypt_exe!getrandom::imp::fallback_rng+0x2b [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\windows.rs @ 147] 
06 000000e8`53cff680 00007ff6`0e02217a     bcrypt_exe!getrandom::imp::getrandom_inner+0x121 [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\windows.rs @ 66] 
07 000000e8`53cff770 00007ff6`0e02221f     bcrypt_exe!getrandom::getrandom_uninit+0x7a [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\lib.rs @ 335] 
08 000000e8`53cff800 00007ff6`0e021e1d     bcrypt_exe!getrandom::getrandom+0x3f [C:\Users\yjugl\.cargo\git\checkouts\getrandom-4e87a219ad2fb861\6f67832\src\lib.rs @ 307] 
09 000000e8`53cff8a0 00007ff6`0e021488     bcrypt_exe!uuid::rng::bytes+0x2d [C:\repos\rust\bcrypt\uuid\src\rng.rs @ 7] 
0a 000000e8`53cff8e0 00007ff6`0e021011     bcrypt_exe!uuid::Uuid::new_v4+0x18 [C:\repos\rust\bcrypt\uuid\src\v4.rs @ 34] 
// First parameter
0:000> du rcx
000001e8`7ce0a700  "C:\WINDOWS\system32\bcryptprimit"
000001e8`7ce0a740  "ives.dll"
// Simulating a failure to load the library
0:000> r rip=poi(rsp); r rsp=rsp+8; r rax=0; g
...
// Result (this is NTSTATUS 0xc000007b)
thread 'main' panicked at 'could not retrieve random bytes for uuid: OS Error: 1073741947', uuid\src\rng.rs:9:13

This recent crash signature indeed suggests that Rust's "fallback" implementation doesn't cover the case where bcryptprimitives.dll fails to load, these crashes all have moz crash reason RNG broken: 0xc000007b, fallback RNG broken: 0xc000007b (see comment 9 for what 0xc000007b means). We need RtlGenRandom.

I filed an issue on Rust StdLib, and they have a PR up to fix this. When that merges, we should be good. (You know... In a few months when the new version of Rust is released)

See Also: → 1816953
Depends on: 1816953
See Also: 1816953
Crash Signature: [@ uuid::Uuid::new_v4] [@ uuid::v4::impl$0::new_v4::closure$0] → [@ uuid::Uuid::new_v4] [@ uuid::v4::impl$0::new_v4::closure$0] [@ std::sys::windows::rand::fallback_rng]

I'm adding the std::sys::windows::rand::fallback_rng signature here because the root cause is the same: calling BCryptGenRandom fails, usually because of failure to load bcryptprimitives.dll. So these signatures should be monitored together. In fact, the reason we do not see the uuid crash in 110 is not because it's fixed, but because these users crash earlier in std::sys::windows::rand::fallback_rng and never reach uuid code.

Important: These STR can leave your system in an unbootable and/or corrupt state. Please follow them in a virtual machine.

Important: These STR only apply to Windows 7, which is the main concern for this crash.

Steps to reproduce:

  • Take a snapshot of your Windows 7 VM, and boot it.
  • Install the Firefox version you want to test.
  • If testing Firefox 32-bit on Windows 64-bit, navigate to C:\Windows\SysWOW64.
    In all other cases, navigate to C:\Windows\System32 instead.
    In case of doubt, do the steps in both folders.
  • Right click on bcryptprimitives.dll, click Properties, then under Security click Advanced.
  • Under Owner, click Edit..., select your current user, click OK, click OK.
  • Close Advanced Security Settings.
  • Back in bcryptprimitives.dll Properties, under Security, click Edit..., select Users, click the Allow box next to Full control, click OK, click Yes.
  • Back in the file explorer, rename bcryptprimitives.dll to bcryptprimitivesnotfound.dll, click Continue.
  • Run Firefox.
  • After noticing the problem, restore your VM snapshot.

Problem:

  • Firefox refuses to start and crashes immediately (on Windows 7).

Expected:

  • Firefox starts (on Windows 7).
Has STR: --- → yes

Below is the stack trace for the first use of BCryptGenRandom in firefox.exe currently. We would need to setup a hook before that call.

00 000000ed`295ff688 00007ffb`d0d84f95     bcrypt!BCryptGenRandom
01 (Inline Function) --------`--------     xul!std::collections::hash::map::impl$82::new::KEYS::__getit::closure$0+0xb0 [/rustc/897e37553bba8b42751c67658967889d11ecd120/library\std\src\thread\local.rs @ 353] 
02 (Inline Function) --------`--------     xul!std::thread::local::lazy::LazyKeyInner::initialize+0xb0 [/rustc/897e37553bba8b42751c67658967889d11ecd120/library\std\src\thread\local.rs @ 809] 
03 000000ed`295ff690 00007ffb`d0d95a86     xul!std::thread::local::fast::Key::try_initialize<core::cell::Cell<tuple$<u64,u64> >,std::collections::hash::map::impl$82::new::KEYS::__getit::closure_env$0>+0xb5 [/rustc/897e37553bba8b42751c67658967889d11ecd120/library\std\src\thread\local.rs @ 987] 
04 (Inline Function) --------`--------     xul!std::thread::local::fast::Key::get+0x22 [/rustc/897e37553bba8b42751c67658967889d11ecd120/library\std\src\thread\local.rs @ 970] 
05 000000ed`295ff6d0 00007ffb`d1ed9fdf     xul!std::collections::hash::map::impl$82::new::KEYS::__getit+0x26 [/rustc/897e37553bba8b42751c67658967889d11ecd120/library\std\src\thread\local.rs @ 356] 
06 (Inline Function) --------`--------     xul!std::thread::local::LocalKey<core::cell::Cell<tuple$<u64,u64> > >::try_with+0x7 [/rustc/897e37553bba8b42751c67658967889d11ecd120\library\std\src\thread\local.rs @ 444] 
07 (Inline Function) --------`--------     xul!std::thread::local::LocalKey<core::cell::Cell<tuple$<u64,u64> > >::with+0x7 [/rustc/897e37553bba8b42751c67658967889d11ecd120\library\std\src\thread\local.rs @ 421] 
08 (Inline Function) --------`--------     xul!std::collections::hash::map::RandomState::new+0x7 [/rustc/897e37553bba8b42751c67658967889d11ecd120\library\std\src\collections\hash\map.rs @ 3129] 
09 (Inline Function) --------`--------     xul!std::collections::hash::map::impl$87::default+0x7 [/rustc/897e37553bba8b42751c67658967889d11ecd120\library\std\src\collections\hash\map.rs @ 3209] 
0a (Inline Function) --------`--------     xul!std::collections::hash::map::impl$8::default+0x7 [/rustc/897e37553bba8b42751c67658967889d11ecd120\library\std\src\collections\hash\map.rs @ 1320] 
0b (Inline Function) --------`--------     xul!std::collections::hash::map::HashMap<enum2$<core::option::Option<alloc::string::String> >,log::LevelFilter,std::collections::hash::map::RandomState>::new+0x7 [/rustc/897e37553bba8b42751c67658967889d11ecd120\library\std\src\collections\hash\map.rs @ 235] 
0c (Inline Function) --------`--------     xul!env_logger::filter::Builder::new+0x7 [/builds/worker/checkouts/gecko/third_party/rust/env_logger/src/filter/mod.rs @ 175] 
0d (Inline Function) --------`--------     xul!env_logger::filter::impl$2::default+0x7 [/builds/worker/checkouts/gecko/third_party/rust/env_logger/src/filter/mod.rs @ 264] 
0e (Inline Function) --------`--------     xul!env_logger::impl$9::default+0x7 [/builds/worker/checkouts/gecko/third_party/rust/env_logger/src/lib.rs @ 382] 
0f 000000ed`295ff700 00007ffb`d1c75247     xul!env_logger::Builder::new+0xf [/builds/worker/checkouts/gecko/third_party/rust/env_logger/src/lib.rs @ 416] 
10 (Inline Function) --------`--------     xul!gecko_logger::GeckoLogger::new+0x5 [/builds/worker/checkouts/gecko/xpcom/rust/gecko_logger/src/lib.rs @ 164] 
11 000000ed`295ff740 00007ffb`d0871510     xul!gecko_logger::GeckoLogger::init+0x17 [/builds/worker/checkouts/gecko/xpcom/rust/gecko_logger/src/lib.rs @ 179] 
12 (Inline Function) --------`--------     xul!gkrust_shared::GkRust_Init+0x5 [/builds/worker/checkouts/gecko/toolkit/library/rust/shared/lib.rs @ 141] 
13 000000ed`295ff9e0 00007ffb`d4f83b1d     xul!NS_InitXPCOM+0x60 [/builds/worker/checkouts/gecko/xpcom/build/XPCOMInit.cpp @ 256] 
14 (Inline Function) --------`--------     xul!ScopedXPCOMStartup::Initialize+0x1f [/builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp @ 2076] 
15 000000ed`295ffb00 00007ffb`d28bfa82     xul!XREMain::XRE_main+0x2ed [/builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp @ 5969] 
16 000000ed`295ffbc0 00007ff7`dd00e9ae     xul!XRE_main+0x62 [/builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp @ 6029] 
17 (Inline Function) --------`--------     firefox!do_main+0xcc [/builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp @ 226] 
18 (Inline Function) --------`--------     firefox!NS_internal_main+0x364 [/builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp @ 430] 
19 000000ed`295ffcc0 00007ff7`dd01cad8     firefox!wmain+0x5fe [/builds/worker/checkouts/gecko/toolkit/xre/nsWindowsWMain.cpp @ 167] 
1a (Inline Function) --------`--------     firefox!invoke_main+0x22 [d:\agent\_work\2\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 90] 
1b 000000ed`295ffef0 00007ffc`81ef26bd     firefox!__scrt_common_main_seh+0x10c [d:\agent\_work\2\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288] 
1c 000000ed`295fff30 00007ffc`8384dfb8     KERNEL32!BaseThreadInitThunk+0x1d
1d 000000ed`295fff60 00000000`00000000     ntdll!RtlUserThreadStart+0x28
Crash Signature: [@ uuid::Uuid::new_v4] [@ uuid::v4::impl$0::new_v4::closure$0] [@ std::sys::windows::rand::fallback_rng] → [@ uuid::Uuid::new_v4] [@ uuid::v4::impl$0::new_v4::closure$0] [@ std::sys::windows::rand::fallback_rng] [@ uuid::rng::bytes::closure$0]
Assignee: nobody → yjuglaret
Attachment #9319295 - Attachment description: WIP: Bug 1788004 - Implement a hook-based fallback for BCryptGenRandom to mitigate Rust panics. r=cmartin,handyman → Bug 1788004 - Implement a hook-based fallback for BCryptGenRandom to mitigate Rust panics. r=cmartin,handyman
Status: NEW → ASSIGNED
See Also: → 1818700
Pushed by yjuglaret@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/78e7281703a2
Implement a hook-based fallback for BCryptGenRandom to mitigate Rust panics. r=cmartin
Pushed by yjuglaret@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a5eb0191c293
Implement a hook-based fallback for BCryptGenRandom to mitigate Rust panics. r=cmartin
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch

Not sure if this is something wanted to try to uplift to Beta or not, but it does graft cleanly if we did.

This is now fixed in Nightly:

The vast majority of crashes are with 32-bit builds, and the string recovery in the investigation from comment 9 more precisely suggests that this a majority of 32-bit builds running on a 64-bit system. This correlation could be explained by the fact that the Windows system itself can maybe cope with failing to load 32-bit bcryptprimitives.dll on a 64-bit system, but not with other variations (i.e. corrupt 64-bit DLL on 64-bit system, or corrupt 32-bit DLL on 32-bit system) where it may refuse to boot. Note that although failing to load bcryptprimitives.dll seems to be the root cause for the vast majority of crashes, BCryptGenRandom can fail for other reasons, and the patch should fix those situations as well provided that BCryptGenRandom fails consistently. For the situations where BCryptGenRandom may fail intermittently, those will only be fixed properly by bug 1816953.

For the hook to take place with 64-bit builds on the versions which I tested, I had to manually add support for some specific instructions in our hooking code. Those instructions were in the 13 first bytes for BCryptGenRandom on my Windows 7 and Windows 11 64-bit variants of bcrypt.dll. If we still see crashes with 64-builds after the patch, that could be explained either by BCryptGenRandom consistent failures and non-supported instructions in our hooking code, and in that case we could look for the 13 first bytes of the bcrypt.dll version found in the crashes and implement them, or by BCryptGenRandom intermittent failures, since we only try to hook if the first call to BCryptGenRandom fails. This is not a problem for 32-bit builds, where BCryptGenRandom has a hot-patch point, so we can apply a hooking strategy that does not require us to implement specific code for each possible variation of the first bytes of the function.

Comment on attachment 9319295 [details]
Bug 1788004 - Implement a hook-based fallback for BCryptGenRandom to mitigate Rust panics. r=cmartin,handyman

Beta/Release Uplift Approval Request

  • User impact if declined: Firefox refuses to start and crashes at launch.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: The STR are listed in comment 19.
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The patch will only change behavior for the users impacted by this crash. More precisely, we only try to hook BCryptGenRandom if that function fails consistently, which is precisely the situation that leads to this crash. We won't try to add a hook on machines where BCryptGenRandom seems to work normally, so the patch should not impact other users.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9319295 - Flags: approval-mozilla-release?
Attachment #9319295 - Flags: approval-mozilla-beta?
Flags: qe-verify+

Comment on attachment 9319295 [details]
Bug 1788004 - Implement a hook-based fallback for BCryptGenRandom to mitigate Rust panics. r=cmartin,handyman

Approved for 111.0b8
Rejecting release uplift request, next week is RC week for 111 and there's no 110 dot release on the radar before then.

Attachment #9319295 - Flags: approval-mozilla-release?
Attachment #9319295 - Flags: approval-mozilla-release-
Attachment #9319295 - Flags: approval-mozilla-beta?
Attachment #9319295 - Flags: approval-mozilla-beta+
QA Whiteboard: [qa-triaged]

I've reproduced the crash using steps from comment 19 using Firefox 110.0 RC on both Windows 7 32 and 64bit VM:

I verified that using latest builds Latest Nightly 112.0a1 from 2023-03-02 and Firefox Beta 110.0b8 the crashes no longer occur on both Windows 7 32 and 64bit VM.

Status: RESOLVED → VERIFIED
Flags: qe-verify+

Firefox Beta 110.0b8 the crashes no longer occur on both Windows 7 32 and 64bit VM.

I think this would have been Beta 111.0b8.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: