Closed Bug 1358151 Opened 8 years ago Closed 8 years ago

Race condition in rust runtime TLS initialization

Categories

(Core :: General, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: ccorcoran, Assigned: ccorcoran)

References

Details

Attachments

(1 file)

Contention between an internal Rust TLS mutex and the Windows system loader lock causes a potential deadlock. This race condition seems to be almost always hidden, though we have seen a couple scenarios recently that revealed it, for example see bug 1322554 comment 125 which got it repro'ing over 50% on certain configurations. Windows 8 64-bit, debug build of Firefox, with this patch which has been backed out: https://hg.mozilla.org/releases/mozilla-aurora/rev/e82a6f21f043) From that example, here's a stack trace of 2 relevant threads while firefox is frozen: > . 0 Id: 2ac4.24fc Suspend: 1 Teb: 00007ff6`db0fd000 Unfrozen > # Child-SP RetAddr Call Site > 00 00000052`253ae8b8 00007ffb`b648ddb9 ntdll!NtWaitForSingleObject+0xa > 01 00000052`253ae8c0 00007ffb`b648b784 ntdll!RtlpWaitOnCriticalSection+0xe1 > 02 00000052`253ae990 00007ffb`b648b67c ntdll!RtlpEnterCriticalSectionContended+0xa4 > 03 00000052`253ae9d0 00007ffb`b648f63e ntdll!LdrpAcquireLoaderLock+0x2c > 04 00000052`253aea10 00007ffb`b36d4250 ntdll!LdrGetProcedureAddressForCaller+0x19e > 05 00000052`253aec00 00007ffb`74daa101 KERNELBASE!GetProcAddressForCaller+0x70 > 06 00000052`253aec50 00007ffb`74daf03a xul!std::sys::imp::compat::lookup+0x101 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\compat.rs @ 34] > 07 00000052`253aed60 00007ffb`74da6aec xul!std::sys::imp::mutex::Mutex::unlock+0x7a [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\mutex.rs @ 97] > 08 00000052`253aedb0 00007ffb`74db6053 xul!std::sys_common::at_exit_imp::push+0xbc [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys_common\at_exit_imp.rs @ 80] > 09 (Inline Function) --------`-------- xul!std::sys_common::at_exit+0x11 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys_common\mod.rs @ 94] > 0a 00000052`253aee10 00007ffb`74db5d22 xul!std::sys::imp::thread_local::init_dtors+0x63 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\thread_local.rs @ 132] > 0b 00000052`253aee60 00007ffb`74da7768 xul!std::sys::imp::thread_local::create+0x32 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\thread_local.rs @ 70] > 0c 00000052`253aeeb0 00007ffb`74dbfdb4 xul!std::sys_common::thread_local::StaticKey::lazy_init+0x18 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys_common\thread_local.rs @ 180] > 0d 00000052`253aeef0 00007ffb`74db7b06 xul!std::panicking::update_panic_count::PANIC_COUNT::__getit+0x24 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\thread\local.rs @ 163] > 0e 00000052`253aef30 00007ffb`79b91151 xul!std::panicking::set_hook+0x26 [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\panicking.rs @ 102] > 0f 00000052`253aefc0 00007ffb`79b7004c xul!CrashReporter::SetExceptionHandler+0x5d [d:\mozilla-central\toolkit\crashreporter\nsexceptionhandler.cpp @ 1610] > 10 00000052`253af200 00007ffb`79b6f0cb xul!XREMain::XRE_mainInit+0x5b0 [d:\mozilla-central\toolkit\xre\nsapprunner.cpp @ 3256] > 11 00000052`253af6b0 00007ffb`79b6e9e4 xul!XREMain::XRE_main+0x68f [d:\mozilla-central\toolkit\xre\nsapprunner.cpp @ 4697] > 12 00000052`253af760 00007ff6`dbed58d6 xul!XRE_main+0x48 [d:\mozilla-central\toolkit\xre\nsapprunner.cpp @ 4810] > 13 00000052`253af910 00007ff6`dbed523a firefox!do_main+0x212 [d:\mozilla-central\browser\app\nsbrowserapp.cpp @ 237] > 14 00000052`253afad0 00007ff6`dbed5dc3 firefox!NS_internal_main+0x13e [d:\mozilla-central\browser\app\nsbrowserapp.cpp @ 309] > 15 00000052`253afb40 00007ff6`dbf2bff9 firefox!wmain+0x163 [d:\mozilla-central\toolkit\xre\nswindowswmain.cpp @ 118] > 16 (Inline Function) --------`-------- firefox!invoke_main+0x22 [f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl @ 79] > 17 00000052`253afba0 00007ffb`b5bf13d2 firefox!__scrt_common_main_seh+0x11d [f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl @ 253] > 18 00000052`253afbe0 00007ffb`b64654e4 KERNEL32!BaseThreadInitThunk+0x22 > 19 00000052`253afc10 00000000`00000000 ntdll!RtlUserThreadStart+0x34 > > 2 Id: 2ac4.2314 Suspend: 1 Teb: 00007ff6`db0f9000 Unfrozen > # Child-SP RetAddr Call Site > 00 00000052`27a0f4a8 00007ffb`b648ac5e ntdll!NtWaitForAlertByThreadId+0xa > 01 00000052`27a0f4b0 00007ffb`74db61ab ntdll!RtlAcquireSRWLockExclusive+0x11e > 02 00000052`27a0f520 00007ffb`b646c0f4 xul!std::sys::imp::thread_local::on_tls_callback+0x4b [C:\bot\slave\stable-dist-rustc-win-msvc-64\build\src\libstd\sys\windows\thread_local.rs @ 243] > 03 00000052`27a0f5d0 00007ffb`b646cfd8 ntdll!LdrpCallInitRoutine+0x4c > 04 00000052`27a0f630 00007ffb`b646bb03 ntdll!LdrpCallTlsInitializers+0x88 > 05 00000052`27a0f690 00007ffb`b64685de ntdll!LdrShutdownThread+0x103 > 06 00000052`27a0f780 00007ffb`b5bf13da ntdll!RtlExitUserThread+0x3e > 07 00000052`27a0f7c0 00007ffb`b64654e4 KERNEL32!BaseThreadInitThunk+0x2a > 08 00000052`27a0f7f0 00000000`00000000 ntdll!RtlUserThreadStart+0x34 > And here's what I believe is happening: > Main thread > { > rust begins lazy-initializing tls system > ACQUIRES rust tls mutex > trying to release that mutex, calls into GetProcAddress > WAITS for loader lock > } > > Thread1 (InitDwrite) > { > ...completes its work, exits > ACQUIRES loader lock > on_tls_callback > WAITS for rust tls mutex > } on_tls_callback should not be acquiring any locks, as it's currently holding the system loader lock. With regards to Firefox, before this is fixed in the Rust runtime, one way I could imagine preventing this deadlock is by ensuring TLS is fully initialized before on_tls_callback can be invoked.
Blocks: 1322554
Severity: normal → critical
I'd love if we could get some fire under this; it blocks a patch for bug 1322554 which is a critical top crasher.
Oh dear sounds like a bad bug! I had no idea that when on_tls_callback was invoked the system held locks that we'd have to avoid... I'll start investigating a fix on the Rust side. Your sequence of events makes sense to me.
Also if anyone knows a better way to implement TLS destructors in general, I'd love to hear about it! AFAIK the only way to register a destructor with a TLS key in Windows is through this mechanism. If locks are held when on_tls_callback is invoked or if its generally a super constrained environment, then that's... not good!
I've posted a fix for this against rust-lang/rust: https://github.com/rust-lang/rust/pull/41512
Alex, thank you for the quick solution! Is there a workaround we could on our end before your patch lands our side? For example, how could we force a lazy init of TLS to avoid the race condition? Ted, do you know when this patch will propagate to us?
Flags: needinfo?(ted)
Flags: needinfo?(acrichton)
Is there a deterministic point where you can run a bit of Rust code before any threads are spawned? If possible you could call into a piece of "noop" rust code like: thread_local!(static FOO: () = ()); #[no_mangle] pub fn rust_init_please_remove_this_after_updating_rust() { FOO.with(|_| ()); }
Flags: needinfo?(acrichton)
(In reply to Carl Corcoran [:ccorcoran] from comment #5) > Ted, do you know when this patch will propagate to us? That patch just landed on Rust nightly (1.19), so even if we aggressively update it'll take a while.
Flags: needinfo?(ted)
Comment on attachment 8865105 [details] bug 1358151: temporary workaround for rust race condition https://reviewboard.mozilla.org/r/136780/#review140202 ::: toolkit/library/rust/shared/lib.rs:27 (Diff revision 1) > + > + > +// For details, see bug 1358151 > +thread_local!(static UNUSED_THREAD_LOCAL: () = ()); > +#[no_mangle] > +pub extern "C" fn rust_init_please_remove_this_after_updating_rust() { A comment mentioning that the fix for the issue this is working around is in Rust 1.19 would be good.
Attachment #8865105 - Flags: review+
Keywords: checkin-needed
Assignee: nobody → ccorcoran
This was successfully autolanded, but the bug marking didn't happen for some reason. https://hg.mozilla.org/integration/autoland/rev/3bc93cae8482
Keywords: checkin-needed
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Apparently we had been seeing this in automation--bug 1362901 describes Marionette test hangs inside SetExceptionHandler.
Blocks: 1405105
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: