Closed Bug 1409444 Opened 7 years ago Closed 7 years ago

stylo: Crash in rand::weak_rng

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla58

Tracking Flags:

Tracking

Status

firefox-esr52

---

unaffected

firefox56

---

unaffected

firefox57

verified

firefox58

---

verified

People

(Reporter: philipp, Assigned: manishearth)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files, 1 obsolete file)

Bug 1409444 - stylo: Update rand to 0.3.17; 7 years ago Manish Goregaokar [:manishearth] 59 bytes, text/x-review-board-request		Details
Bug 1409444 - Update rand to 0.3.17. 7 years ago Xidorn Quan [:xidorn] UTC+11 59 bytes, text/x-review-board-request	manishearth : review+ ritu : approval-mozilla-beta+	Details
patch for beta 7 years ago Xidorn Quan [:xidorn] UTC+11 226.03 KB, patch		Details \| Diff \| Splinter Review

[:philipp]

Reporter

Description

•

7 years ago

This bug was filed from the Socorro interface and is 
report bp-ed6d1e84-c4e2-402e-b778-05a390171017.
=============================================================
Crashing Thread (29)
Frame 	Module 	Signature 	Source
0 	xul.dll 	std::panicking::rust_panic_with_hook 	src/libstd/panicking.rs:555
1 	xul.dll 	std::panicking::begin_panic<collections::string::String> 	src/libstd/panicking.rs:511
2 	xul.dll 	std::panicking::begin_panic_fmt 	src/libstd/panicking.rs:495
3 	xul.dll 	rand::weak_rng 	third_party/rust/rand/src/lib.rs:856
4 	xul.dll 	std::sys_common::backtrace::__rust_begin_short_backtrace<closure, ()> 	src/libstd/sys_common/backtrace.rs:133
5 	xul.dll 	alloc::boxed::{{impl}}::call_box<(), closure> 	src/liballoc/boxed.rs:647
6 	xul.dll 	std::sys::imp::thread::{{impl}}::new::thread_start 	src/libstd/sys/windows/thread.rs:50
7 	kernel32.dll 	BaseThreadInitThunk 	
8 	ntdll.dll 	RtlUserThreadStart

this rust panic signature in content processes is regressing in larger numbers since 57.0b6 and 58.0a1 build 20171006220306.
bug fixes that went into b6 were: https://mzl.la/2fPo1by

the crashes seem to hit particular installations repeatedly and come with various moz crash reasons:
https://crash-stats.mozilla.com/search/?signature=%3Drand%3A%3Aweak_rng&date=%3E%3D2017-10-01T00%3A00%3A00.000Z&_facets=moz_crash_reason#facet-moz_crash_reason

Chris Peterson [:cpeterson]

Comment 1

•

7 years ago

Manish, given your experience debugging the hashmap RNG failures, any thoughts on these weak_rng() panics after OsRng::new() fails? The errors affect all Windows versions (plus a few OS X 10.12.x users), but 94% of the crash reports are from Windows 7.

Perhaps the Rust rand library should work around these RNG failures in the Windows implementation of OsRng::new(). That would fix the whole Rust ecosystem and we wouldn't need to play Whac-A-Mole with RNG panics in Firefox. :)

Blocks: stylo-crash-reports

Crash Signature: [@ rand::weak_rng] → [@ mozalloc_abort | abort | rand::weak_rng] [@ rand::weak_rng]

Flags: needinfo?(manishearth)

Priority: -- → P3

Manish Goregaokar [:manishearth]

Assignee

Comment 2

•

7 years ago

This is the same problem, OsRng sometimes gets funky failures from Windows, as far as I can tell due to broken DLLs.

We fixed the last one by removing uses of rng from hashmaps. This can't be fixed the same way; this is part of the rust runtime (and the fix would need to be in the stdlib's copy of the rand crate)

Flags: needinfo?(manishearth)

Chris Peterson [:cpeterson]

Comment 3

•

7 years ago

(In reply to Manish Goregaokar [:manishearth] from comment #2)
> We fixed the last one by removing uses of rng from hashmaps. This can't be
> fixed the same way; this is part of the rust runtime (and the fix would need
> to be in the stdlib's copy of the rand crate)

Yes, I'm suggesting we fix this upstream in Rust's stdlib.

Manish Goregaokar [:manishearth]

Assignee

Comment 4

•

7 years ago

I don't see how this can be fixed, though. The OS's RNG returned a weird error code; what *can* we do?

Manish Goregaokar [:manishearth]

Assignee

Comment 5

•

7 years ago

Also, this isn't actually a runtime bug, I realized that the backtrace indicates that we had a weak_rng failure in a thread we spawned.

Looking at callers of weak_rng apparently Rayon threads initialize an RNG

Chris Peterson [:cpeterson]

Comment 6

•

7 years ago

weak_rng could silently fall back to a weaker seed value like the current time if OsRng fails. weak_rng doesn't make any promises about being cryptographically secure.

Chris Peterson [:cpeterson]

Comment 7

•

7 years ago

Alex, as an owner of the "rand" crate, what do you think about changing weak_rng to silently fall back to a weaker seed value (like the current time + whatever) instead of panicking when OsRng::new() returns an error?

Crashing Firefox because weak_rng has a weak seed seems pretty extreme. These crashes currently only affect the Firefox content process, but they will affect the main browser process once we start using Stylo for XUL/chrome documents.

This panic is currently Windows content process top crash #10 in Beta 57. It's only #63 in Nightly 58, probably because fewer Nightly users have broken Windows 7 systems.

Besides Windows' CryptAcquireContext returning weird errors, we've hit OS RNG failures in Firefox for Android trying to open("/dev/urandom") because the OS ran out of fds in automation (bug 1140806).

Flags: needinfo?(acrichton)

Manish Goregaokar [:manishearth]

Assignee

Comment 8

•

7 years ago

I also filed https://github.com/rust-lang-nursery/rand/issues/180

Alex Crichton [:acrichto]

Comment 9

•

7 years ago

Looking over this again. It seems like in any case you're considering upgrading and/or updating the `rand` crate, right? If we were to implement a fix, that is, you'd presumably update. Looking at the crash report you're currently using 0.3.15 of the `rand` crate, and in the meantime we've since published up to 0.3.17 for the `rand` crate.

The updates notably include https://github.com/rust-lang-nursery/rand/issues/111 which switches the implementation of OS randomness on Windows to a private API in Windows itself. On that issue it was also indicated that it's the same strategy Gecko takes I believe?

Basically what I'm thinking is that if you're thinking of updating anyway to fix this, maybe it's worth updating to 0.3.17 to pull in that fix and see what happens? That may end up transitively fixing https://github.com/rust-lang/rust/issues/44911 which would I think solve this issue, right?

Failing that it seems fine to me to fall back to some sort of a weaker seed for a "weak rng", it is, after all, weak!

Flags: needinfo?(acrichton)

Chris Peterson [:cpeterson]

Comment 10

•

7 years ago

Manish, can you try updating Nightly 58 from `rand` crate version 0.3.15 to 0.3.17 to see if https://github.com/rust-lang-nursery/rand/issues/111 avoids this crash?

This crash signature is content process top crash #8 in Beta 57, which is pretty high. It would be good to address this somehow, though I doubt we would want to uplift the `rand` 0.3.17 crate this late in Beta 57.

For some reason, this signature jumped in Beta 57 57.0b6 (2017-10-03) and continues to get worse:

57.0b3 = 3 reports
57.0b4 = 1
57.0b5 = 1
57.0b6 = 117
57.0b7 = 188
57.0b8 = 296
57.0b9 = 369

The pushlog between 57.0b5 and 57.0b6 includes your FnvHashMap fix (bug 1385971). Maybe we just swapped the hash::map::{{impl}}::new::KEYS::__init crashes for weak_rng for these sad Windows 7 users?

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_57_0b5_RELEASE&tochange=FIREFOX_57_0b6_RELEASE

Flags: needinfo?(manishearth)

Chris Peterson [:cpeterson]

Comment 11

•

7 years ago

[Tracking Requested - why for this release]:

This Stylo crash signature is content process top crash #8 in Beta 57.

This crash spiked in 57.0b6 after bug 1385971 fixed a similar crash (thus allowing those crashing users to hit this new crash instead).

tracking-firefox57: --- → ?

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/191740/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/191740/

Manish Goregaokar [:manishearth]

Assignee

Comment 13

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6ec2d9004c9559a1040247aa893340b892d15e4

Flags: needinfo?(manishearth)

Xidorn Quan [:xidorn] UTC+11

Comment 14

•

7 years ago

Lots of .cargo-ok changes... Try using cargo-vendor 0.1.11? Or we should make some progress on bug 1403407...

Emilio Cobos Álvarez (:emilio)

Comment 15

•

7 years ago

Wait, aren't this crashes inside rust stdlib support? How does updating the crate help?

Emilio Cobos Álvarez (:emilio)

Comment 16

•

7 years ago

(In reply to Emilio Cobos Álvarez [:emilio] from comment #15)
> Wait, aren't this crashes inside rust stdlib support? How does updating the
> crate help?

err, s/support/code/, but my question still stands, this updates the crate in the rust libs, but not in the stdlib, right? (Though it may be the case that I'm missing something).

Manish Goregaokar [:manishearth]

Assignee

Comment 17

•

7 years ago

They're not. I thought they were.

They're in the Rayon thread initialization code. I looked at the stacks initially and thought they were in the Rust runtime but on looking closer the rust runtime does not rely on weak_rng.

Manish Goregaokar [:manishearth]

Assignee

Updated

•

7 years ago

Assignee: nobody → manishearth

Xidorn Quan [:xidorn] UTC+11

Comment 18

•

7 years ago

mozreview-review

Comment on attachment 8920726 [details]
Bug 1409444 - stylo: Update rand to 0.3.17;

https://reviewboard.mozilla.org/r/191740/#review197030

This seems to be updating lots of unrelated things... Please just update rand (and maybe its dependencies).

Attachment #8920726 - Flags: review?(xidorn+moz)

Manish Goregaokar [:manishearth]

Assignee

Comment 19

•

7 years ago

That is what I did. cargo update -p does the minimum possible update.

Comment hidden (mozreview-request)

Review commit: https://reviewboard.mozilla.org/r/191906/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/191906/

Xidorn Quan [:xidorn] UTC+11

Comment 21

•

7 years ago

This should be the minimum possible update.

Manish Goregaokar [:manishearth]

Assignee

Comment 22

•

7 years ago

mozreview-review

Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.

https://reviewboard.mozilla.org/r/191906/#review197082

Attachment #8920947 - Flags: review?(manishearth) → review+

Pulsebot

Comment 23

•

7 years ago

Pushed by xquan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2dce277776e9
Update rand to 0.3.17. r=manishearth

Marcia Knous [:marcia]

Comment 24

•

7 years ago

Tracking 57+ based on Comment 11.

Marcia Knous [:marcia]

Updated

•

7 years ago

tracking-firefox57: ? → +

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 25

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/2dce277776e9

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox58: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla58

Xidorn Quan [:xidorn] UTC+11

Comment 26

•

7 years ago

Do you think we should uplift it to 57? If so, could you create the uplift request? I'm not sure about all the details with this bug.

Flags: needinfo?(manishearth)

Manish Goregaokar [:manishearth]

Assignee

Updated

•

7 years ago

Attachment #8920726 - Attachment is obsolete: true

Flags: needinfo?(manishearth)

Manish Goregaokar [:manishearth]

Assignee

Comment 27

•

7 years ago

Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.

Approval Request Comment
[Feature/Bug causing the regression]: Stylo
[User impact if declined]: Crashes on Windows systems with broken crypto DLLs
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: We haven't seen any crashes yet
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: No
[Why is the change risky/not risky?]: Updates some Rust libraries.
[String changes made/needed]: No

Attachment #8920947 - Flags: approval-mozilla-beta?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 28

•

7 years ago

Comment on attachment 8920947 [details]
Bug 1409444 - Update rand to 0.3.17.

This crash has ~500 instances in a week, let's take the fix, beta57+

Attachment #8920947 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 29

•

7 years ago

This needs a rebased patch for Beta uplift.

Flags: needinfo?(manishearth)

Xidorn Quan [:xidorn] UTC+11

Comment 30

•

7 years ago

Hold on. We may also need to address bug 1411250 first.

Chris Peterson [:cpeterson]

Comment 31

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 (less responsive Nov 5 ~ Dec 16) from comment #30)
> Hold on. We may also need to address bug 1411250 first.

This uplift will hopefully fix content process top crash #5 that Windows Beta 57 users are hitting. I don't think this uplift needs to wait for bug 1411250 to fix a Linux test failure. We can uplift bug 1411250 when its fix is available.

Ryan VanderMeulen [:RyanVM]

Comment 32

•

7 years ago

This fix has already missed a beta now because of the conflicts (and there's only two left this cycle). Can we please move forward with a rebased patch?

Flags: needinfo?(xidorn+moz)

Xidorn Quan [:xidorn] UTC+11

Comment 33

•

7 years ago

Attached patch patch for beta — Details — Splinter Review

Flags: needinfo?(xidorn+moz)

Flags: needinfo?(manishearth)

Xidorn Quan [:xidorn] UTC+11

Comment 34

•

7 years ago

I guess I can just land it directly.

Xidorn Quan [:xidorn] UTC+11

Comment 35

•

7 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/196059cada70

Xidorn Quan [:xidorn] UTC+11

Updated

•

7 years ago

status-firefox57: affected → fixed

Target Milestone: mozilla58 → mozilla57

Ryan VanderMeulen [:RyanVM]

Comment 36

•

7 years ago

FYI, TM tracks landing on m-c.

Target Milestone: mozilla57 → mozilla58

Chris Peterson [:cpeterson]

Comment 37

•

7 years ago

I think this speculative fix is working because we haven't seen any weak_rng crash reports in Nightly 58.0a1 since we landed this fix on October 23 (build id 2017, though the crash volume had been very low on Nightly:

https://crash-stats.mozilla.com/search/?signature=~weak_rng&product=Firefox&version=58.0a1&platform=Windows&date=%3E%3D2017-04-30T15%3A51%3A19.000Z&_sort=-date&_facets=signature&_facets=version&_facets=platform_pretty_version&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-build_id

This fix will ship in Beta 57.0b13 today, so we'll see if the Beta crash reports stop after 57.0b12:

https://crash-stats.mozilla.com/search/?signature=~weak_rng&product=Firefox&version=57.0b&platform=Windows&date=%3E%3D2017-04-30T08%3A51%3A19.000Z&_sort=-date&_facets=signature&_facets=version&_facets=platform_pretty_version&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

Chris Peterson [:cpeterson]

Comment 38

•

7 years ago

There have been no weak_rng crash reports in 56.0b13, so we can assume we've fixed the crash!

https://crash-stats.mozilla.com/search/?signature=~weak_rng&product=Firefox&version=57.0b&platform=Windows&date=%3E%3D2017-10-23T10%3A27%3A29.000Z&date=%3C2017-11-06T09%3A27%3A29.000Z&_sort=-date&_facets=signature&_facets=version&_facets=platform_pretty_version&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

Status: RESOLVED → VERIFIED

status-firefox57: fixed → verified

status-firefox58: fixed → verified

Chris Peterson [:cpeterson]

Updated

•

5 years ago