Closed Bug 1358151 Opened 7 years ago Closed 7 years ago

Race condition in rust runtime TLS initialization

Categories

(Core :: General, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: ccorcoran, Assigned: ccorcoran)

References

Details

Attachments

(1 file)

Contention between an internal Rust TLS mutex and the Windows system loader lock causes a potential deadlock.

This race condition seems to be almost always hidden, though we have seen a couple scenarios recently that revealed it, for example see bug 1322554 comment 125 which got it repro'ing over 50% on certain configurations. Windows 8 64-bit, debug build of Firefox, with this patch which has been backed out: https://hg.mozilla.org/releases/mozilla-aurora/rev/e82a6f21f043)

From that example, here's a stack trace of 2 relevant threads while firefox is frozen:

> .  0  Id: 2ac4.24fc Suspend: 1 Teb: 00007ff6`db0fd000 Unfrozen
>  # Child-SP          RetAddr           Call Site
> 00 00000052`253ae8b8 00007ffb`b648ddb9 ntdll!NtWaitForSingleObject+0xa
> 01 00000052`253ae8c0 00007ffb`b648b784 ntdll!RtlpWaitOnCriticalSection+0xe1
> 02 00000052`253ae990 00007ffb`b648b67c ntdll!RtlpEnterCriticalSectionContended+0xa4
> 03 00000052`253ae9d0 00007ffb`b648f63e ntdll!LdrpAcquireLoaderLock+0x2c
> 04 00000052`253aea10 00007ffb`b36d4250 ntdll!LdrGetProcedureAddressForCaller+0x19e
> 05 00000052`253aec00 00007ffb`74daa101 KERNELBASE!GetProcAddressForCaller+0x70
> 06 00000052`253aec50 00007ffb`74daf03a xul!std::sys::imp::compat::lookup+0x101 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\compat.rs @ 34]
> 07 00000052`253aed60 00007ffb`74da6aec xul!std::sys::imp::mutex::Mutex::unlock+0x7a [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\mutex.rs @ 97]
> 08 00000052`253aedb0 00007ffb`74db6053 xul!std::sys_common::at_exit_imp::push+0xbc [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys_common\at_exit_imp.rs @ 80]
> 09 (Inline Function) --------`-------- xul!std::sys_common::at_exit+0x11 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys_common\mod.rs @ 94]
> 0a 00000052`253aee10 00007ffb`74db5d22 xul!std::sys::imp::thread_local::init_dtors+0x63 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\thread_local.rs @ 132]
> 0b 00000052`253aee60 00007ffb`74da7768 xul!std::sys::imp::thread_local::create+0x32 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\thread_local.rs @ 70]
> 0c 00000052`253aeeb0 00007ffb`74dbfdb4 xul!std::sys_common::thread_local::StaticKey::lazy_init+0x18 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys_common\thread_local.rs @ 180]
> 0d 00000052`253aeef0 00007ffb`74db7b06 xul!std::panicking::update_panic_count::PANIC_COUNT::__getit+0x24 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\thread\local.rs @ 163]
> 0e 00000052`253aef30 00007ffb`79b91151 xul!std::panicking::set_hook+0x26 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\panicking.rs @ 102]
> 0f 00000052`253aefc0 00007ffb`79b7004c xul!CrashReporter::SetExceptionHandler+0x5d [d:\mozilla-central\toolkit\crashreporter\nsexceptionhandler.cpp @ 1610]
> 10 00000052`253af200 00007ffb`79b6f0cb xul!XREMain::XRE_mainInit+0x5b0 [d:\mozilla-central\toolkit\xre\nsapprunner.cpp @ 3256]
> 11 00000052`253af6b0 00007ffb`79b6e9e4 xul!XREMain::XRE_main+0x68f [d:\mozilla-central\toolkit\xre\nsapprunner.cpp @ 4697]
> 12 00000052`253af760 00007ff6`dbed58d6 xul!XRE_main+0x48 [d:\mozilla-central\toolkit\xre\nsapprunner.cpp @ 4810]
> 13 00000052`253af910 00007ff6`dbed523a firefox!do_main+0x212 [d:\mozilla-central\browser\app\nsbrowserapp.cpp @ 237]
> 14 00000052`253afad0 00007ff6`dbed5dc3 firefox!NS_internal_main+0x13e [d:\mozilla-central\browser\app\nsbrowserapp.cpp @ 309]
> 15 00000052`253afb40 00007ff6`dbf2bff9 firefox!wmain+0x163 [d:\mozilla-central\toolkit\xre\nswindowswmain.cpp @ 118]
> 16 (Inline Function) --------`-------- firefox!invoke_main+0x22 [f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl @ 79]
> 17 00000052`253afba0 00007ffb`b5bf13d2 firefox!__scrt_common_main_seh+0x11d [f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl @ 253]
> 18 00000052`253afbe0 00007ffb`b64654e4 KERNEL32!BaseThreadInitThunk+0x22
> 19 00000052`253afc10 00000000`00000000 ntdll!RtlUserThreadStart+0x34
> 
>    2  Id: 2ac4.2314 Suspend: 1 Teb: 00007ff6`db0f9000 Unfrozen
>  # Child-SP          RetAddr           Call Site
> 00 00000052`27a0f4a8 00007ffb`b648ac5e ntdll!NtWaitForAlertByThreadId+0xa
> 01 00000052`27a0f4b0 00007ffb`74db61ab ntdll!RtlAcquireSRWLockExclusive+0x11e
> 02 00000052`27a0f520 00007ffb`b646c0f4 xul!std::sys::imp::thread_local::on_tls_callback+0x4b [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\thread_local.rs @ 243]
> 03 00000052`27a0f5d0 00007ffb`b646cfd8 ntdll!LdrpCallInitRoutine+0x4c
> 04 00000052`27a0f630 00007ffb`b646bb03 ntdll!LdrpCallTlsInitializers+0x88
> 05 00000052`27a0f690 00007ffb`b64685de ntdll!LdrShutdownThread+0x103
> 06 00000052`27a0f780 00007ffb`b5bf13da ntdll!RtlExitUserThread+0x3e
> 07 00000052`27a0f7c0 00007ffb`b64654e4 KERNEL32!BaseThreadInitThunk+0x2a
> 08 00000052`27a0f7f0 00000000`00000000 ntdll!RtlUserThreadStart+0x34
> 

And here's what I believe is happening:

> Main thread
> {
> 	rust begins lazy-initializing tls system
> 	ACQUIRES rust tls mutex
> 	trying to release that mutex, calls into GetProcAddress
> 	WAITS for loader lock
> }
> 
> Thread1 (InitDwrite)
> {
> 	...completes its work, exits
> 	ACQUIRES loader lock
> 	on_tls_callback
> 	WAITS for rust tls mutex
> }


on_tls_callback should not be acquiring any locks, as it's currently holding the system loader lock.

With regards to Firefox, before this is fixed in the Rust runtime, one way I could imagine preventing this deadlock is by ensuring TLS is fully initialized before on_tls_callback can be invoked.
Blocks: 1322554
Severity: normal → critical
I'd love if we could get some fire under this; it blocks a patch for bug 1322554 which is a critical top crasher.
Oh dear sounds like a bad bug! I had no idea that when on_tls_callback was invoked the system held locks that we'd have to avoid...

I'll start investigating a fix on the Rust side. Your sequence of events makes sense to me.
Also if anyone knows a better way to implement TLS destructors in general, I'd love to hear about it! AFAIK the only way to register a destructor with a TLS key in Windows is through this mechanism. If locks are held when on_tls_callback is invoked or if its generally a super constrained environment, then that's... not good!
I've posted a fix for this against rust-lang/rust: https://github.com/rust-lang/rust/pull/41512
Alex, thank you for the quick solution! Is there a workaround we could on our end before your patch lands our side? For example, how could we force a lazy init of TLS to avoid the race condition?

Ted, do you know when this patch will propagate to us?
Flags: needinfo?(ted)
Flags: needinfo?(acrichton)
Is there a deterministic point where you can run a bit of Rust code before any threads are spawned? If possible you could call into a piece of "noop" rust code like:


thread_local!(static FOO: () = ());

#[no_mangle]
pub fn rust_init_please_remove_this_after_updating_rust() {
    FOO.with(|_| ());
}
Flags: needinfo?(acrichton)
(In reply to Carl Corcoran [:ccorcoran] from comment #5)
> Ted, do you know when this patch will propagate to us?

That patch just landed on Rust nightly (1.19), so even if we aggressively update it'll take a while.
Flags: needinfo?(ted)
Comment on attachment 8865105 [details]
bug 1358151: temporary workaround for rust race condition

https://reviewboard.mozilla.org/r/136780/#review140202

::: toolkit/library/rust/shared/lib.rs:27
(Diff revision 1)
> +
> +
> +// For details, see bug 1358151
> +thread_local!(static UNUSED_THREAD_LOCAL: () = ());
> +#[no_mangle]
> +pub extern "C" fn rust_init_please_remove_this_after_updating_rust() {

A comment mentioning that the fix for the issue this is working around is in Rust 1.19 would be good.
Attachment #8865105 - Flags: review+
Keywords: checkin-needed
Assignee: nobody → ccorcoran
This was successfully autolanded, but the bug marking didn't happen for some reason.

https://hg.mozilla.org/integration/autoland/rev/3bc93cae8482
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/3bc93cae8482
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Apparently we had been seeing this in automation--bug 1362901 describes Marionette test hangs inside SetExceptionHandler.
Blocks: 1405105
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: