stylo: Crash in rand::weak_rng

VERIFIED FIXED in Firefox 57

Status

()

defect
P3
critical
VERIFIED FIXED
2 years ago
2 years ago

People

(Reporter: philipp, Assigned: manishearth)

Tracking

(Blocks 1 bug, {crash, regression})

57 Branch
mozilla58
All
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox56 unaffected, firefox57+ verified, firefox58 verified)

Details

(crash signature)

Attachments

(2 attachments, 1 obsolete attachment)

Reporter

Description

2 years ago
This bug was filed from the Socorro interface and is 
report bp-ed6d1e84-c4e2-402e-b778-05a390171017.
=============================================================
Crashing Thread (29)
Frame 	Module 	Signature 	Source
0 	xul.dll 	std::panicking::rust_panic_with_hook 	src/libstd/panicking.rs:555
1 	xul.dll 	std::panicking::begin_panic<collections::string::String> 	src/libstd/panicking.rs:511
2 	xul.dll 	std::panicking::begin_panic_fmt 	src/libstd/panicking.rs:495
3 	xul.dll 	rand::weak_rng 	third_party/rust/rand/src/lib.rs:856
4 	xul.dll 	std::sys_common::backtrace::__rust_begin_short_backtrace<closure, ()> 	src/libstd/sys_common/backtrace.rs:133
5 	xul.dll 	alloc::boxed::{{impl}}::call_box<(), closure> 	src/liballoc/boxed.rs:647
6 	xul.dll 	std::sys::imp::thread::{{impl}}::new::thread_start 	src/libstd/sys/windows/thread.rs:50
7 	kernel32.dll 	BaseThreadInitThunk 	
8 	ntdll.dll 	RtlUserThreadStart

this rust panic signature in content processes is regressing in larger numbers since 57.0b6 and 58.0a1 build 20171006220306.
bug fixes that went into b6 were: https://mzl.la/2fPo1by

the crashes seem to hit particular installations repeatedly and come with various moz crash reasons:
https://crash-stats.mozilla.com/search/?signature=%3Drand%3A%3Aweak_rng&date=%3E%3D2017-10-01T00%3A00%3A00.000Z&_facets=moz_crash_reason#facet-moz_crash_reason
Manish, given your experience debugging the hashmap RNG failures, any thoughts on these weak_rng() panics after OsRng::new() fails? The errors affect all Windows versions (plus a few OS X 10.12.x users), but 94% of the crash reports are from Windows 7.

Perhaps the Rust rand library should work around these RNG failures in the Windows implementation of OsRng::new(). That would fix the whole Rust ecosystem and we wouldn't need to play Whac-A-Mole with RNG panics in Firefox. :)
Crash Signature: [@ rand::weak_rng] → [@ mozalloc_abort | abort | rand::weak_rng] [@ rand::weak_rng]
Flags: needinfo?(manishearth)
Priority: -- → P3
This is the same problem, OsRng sometimes gets funky failures from Windows, as far as I can tell due to broken DLLs.

We fixed the last one by removing uses of rng from hashmaps. This can't be fixed the same way; this is part of the rust runtime (and the fix would need to be in the stdlib's copy of the rand crate)
Flags: needinfo?(manishearth)
(In reply to Manish Goregaokar [:manishearth] from comment #2)
> We fixed the last one by removing uses of rng from hashmaps. This can't be
> fixed the same way; this is part of the rust runtime (and the fix would need
> to be in the stdlib's copy of the rand crate)

Yes, I'm suggesting we fix this upstream in Rust's stdlib.
I don't see how this can be fixed, though. The OS's RNG returned a weird error code; what *can* we do?
Also, this isn't actually a runtime bug, I realized that the backtrace indicates that we had a weak_rng failure in a thread we spawned.

Looking at callers of weak_rng apparently Rayon threads initialize an RNG
weak_rng could silently fall back to a weaker seed value like the current time if OsRng fails. weak_rng doesn't make any promises about being cryptographically secure.
Alex, as an owner of the "rand" crate, what do you think about changing weak_rng to silently fall back to a weaker seed value (like the current time + whatever) instead of panicking when OsRng::new() returns an error?

Crashing Firefox because weak_rng has a weak seed seems pretty extreme. These crashes currently only affect the Firefox content process, but they will affect the main browser process once we start using Stylo for XUL/chrome documents.

This panic is currently Windows content process top crash #10 in Beta 57. It's only #63 in Nightly 58, probably because fewer Nightly users have broken Windows 7 systems.

Besides Windows' CryptAcquireContext returning weird errors, we've hit OS RNG failures in Firefox for Android trying to open("/dev/urandom") because the OS ran out of fds in automation (bug 1140806).
Flags: needinfo?(acrichton)
Looking over this again. It seems like in any case you're considering upgrading and/or updating the `rand` crate, right? If we were to implement a fix, that is, you'd presumably update. Looking at the crash report you're currently using 0.3.15 of the `rand` crate, and in the meantime we've since published up to 0.3.17 for the `rand` crate.

The updates notably include https://github.com/rust-lang-nursery/rand/issues/111 which switches the implementation of OS randomness on Windows to a private API in Windows itself. On that issue it was also indicated that it's the same strategy Gecko takes I believe?

Basically what I'm thinking is that if you're thinking of updating anyway to fix this, maybe it's worth updating to 0.3.17 to pull in that fix and see what happens? That may end up transitively fixing https://github.com/rust-lang/rust/issues/44911 which would I think solve this issue, right?

Failing that it seems fine to me to fall back to some sort of a weaker seed for a "weak rng", it is, after all, weak!
Flags: needinfo?(acrichton)
Manish, can you try updating Nightly 58 from `rand` crate version 0.3.15 to 0.3.17 to see if https://github.com/rust-lang-nursery/rand/issues/111 avoids this crash?

This crash signature is content process top crash #8 in Beta 57, which is pretty high. It would be good to address this somehow, though I doubt we would want to uplift the `rand` 0.3.17 crate this late in Beta 57.

For some reason, this signature jumped in Beta 57 57.0b6 (2017-10-03) and continues to get worse:

57.0b3 = 3 reports
57.0b4 = 1
57.0b5 = 1
57.0b6 = 117
57.0b7 = 188
57.0b8 = 296
57.0b9 = 369

The pushlog between 57.0b5 and 57.0b6 includes your FnvHashMap fix (bug 1385971). Maybe we just swapped the hash::map::{{impl}}::new::KEYS::__init crashes for weak_rng for these sad Windows 7 users?

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_57_0b5_RELEASE&tochange=FIREFOX_57_0b6_RELEASE
Flags: needinfo?(manishearth)
[Tracking Requested - why for this release]:

This Stylo crash signature is content process top crash #8 in Beta 57.

This crash spiked in 57.0b6 after bug 1385971 fixed a similar crash (thus allowing those crashing users to hit this new crash instead).
Comment hidden (mozreview-request)
Lots of .cargo-ok changes... Try using cargo-vendor 0.1.11? Or we should make some progress on bug 1403407...
Wait, aren't this crashes inside rust stdlib support? How does updating the crate help?
(In reply to Emilio Cobos Álvarez [:emilio] from comment #15)
> Wait, aren't this crashes inside rust stdlib support? How does updating the
> crate help?

err, s/support/code/, but my question still stands, this updates the crate in the rust libs, but not in the stdlib, right? (Though it may be the case that I'm missing something).
They're not. I thought they were.

They're in the Rayon thread initialization code. I looked at the stacks initially and thought they were in the Rust runtime but on looking closer the rust runtime does not rely on weak_rng.
Assignee

Updated

2 years ago
Assignee: nobody → manishearth

Comment 18

2 years ago
mozreview-review
Comment on attachment 8920726 [details]
Bug 1409444 - stylo: Update rand to 0.3.17;

https://reviewboard.mozilla.org/r/191740/#review197030

This seems to be updating lots of unrelated things... Please just update rand (and maybe its dependencies).
Attachment #8920726 - Flags: review?(xidorn+moz)
That is what I did. cargo update -p does the minimum possible update.
Comment hidden (mozreview-request)
This should be the minimum possible update.
Assignee

Comment 22

2 years ago
mozreview-review
Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.

https://reviewboard.mozilla.org/r/191906/#review197082
Attachment #8920947 - Flags: review?(manishearth) → review+

Comment 23

2 years ago
Pushed by xquan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2dce277776e9
Update rand to 0.3.17. r=manishearth
Tracking 57+ based on Comment 11.
https://hg.mozilla.org/mozilla-central/rev/2dce277776e9
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Do you think we should uplift it to 57? If so, could you create the uplift request? I'm not sure about all the details with this bug.
Flags: needinfo?(manishearth)
Assignee

Updated

2 years ago
Attachment #8920726 - Attachment is obsolete: true
Flags: needinfo?(manishearth)
Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.

Approval Request Comment
[Feature/Bug causing the regression]: Stylo
[User impact if declined]: Crashes on Windows systems with broken crypto DLLs
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: We haven't seen any crashes yet
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: Updates some Rust libraries.
[String changes made/needed]: No
Attachment #8920947 - Flags: approval-mozilla-beta?
Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.

This crash has ~500 instances in a week, let's take the fix, beta57+
Attachment #8920947 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This needs a rebased patch for Beta uplift.
Flags: needinfo?(manishearth)
Hold on. We may also need to address bug 1411250 first.
(In reply to Xidorn Quan [:xidorn] UTC+10 (less responsive Nov 5 ~ Dec 16) from comment #30)
> Hold on. We may also need to address bug 1411250 first.

This uplift will hopefully fix content process top crash #5 that Windows Beta 57 users are hitting. I don't think this uplift needs to wait for bug 1411250 to fix a Linux test failure. We can uplift bug 1411250 when its fix is available.
This fix has already missed a beta now because of the conflicts (and there's only two left this cycle). Can we please move forward with a rebased patch?
Flags: needinfo?(xidorn+moz)
Flags: needinfo?(xidorn+moz)
Flags: needinfo?(manishearth)
I guess I can just land it directly.
Target Milestone: mozilla58 → mozilla57
FYI, TM tracks landing on m-c.
Target Milestone: mozilla57 → mozilla58
You need to log in before you can comment on or make changes to this bug.