Closed Bug 1385971 Opened 7 years ago Closed 7 years ago

stylo: Crash in core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init | style::properties::cascade

Categories

(Core :: CSS Parsing and Computation, defect, P2)

56 Branch
Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 --- unaffected
firefox56 --- wontfix
firefox57 --- fixed
firefox58 --- fixed

People

(Reporter: calixte, Assigned: manishearth, NeedInfo)

References

Details

(Keywords: crash)

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-2a72f9aa-bb7c-42f5-ac45-0f7880170729.
=============================================================

There is 1 crash in nightly 56 with buildid 20170729100254.
The backtrace is showing that 'style::properties::cascade' is calling std::collections::hash::map.
By the way this crash signature exists for non-stylo crashes.
Priority: -- → P2
Priority: P2 → --
Priority: -- → P3
FWIW, bug 1385976 has a crash report with the same signature starting in mp4parse_new, where I don't see any way the HashMap init machinery would be called, so I fear this is unrelated corruption.
(In reply to Ralph Giles (:rillian) | needinfo me from comment #1)
> FWIW, bug 1385976 has a crash report with the same signature starting in
> mp4parse_new, where I don't see any way the HashMap init machinery would be
> called, so I fear this is unrelated corruption.

Ok.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Reopening this bug and marking this bug as Stylo P1 because this is top crash #10 in Beta 56. I see 88 instances of this crash signature in Beta 56.0b1, but none from Nightly 57.0a1.

https://crash-stats.mozilla.com/search/?signature=%3Dcore%3A%3Aresult%3A%3Aunwrap_failed%3CT%3E%20%7C%20std%3A%3Acollections%3A%3Ahash%3A%3Amap%3A%3A%7B%7Bimpl%7D%7D%3A%3Anew%3A%3AKEYS%3A%3A__init%20%7C%20style%3A%3Aproperties%3A%3Acascade&product=Firefox&date=%3E%3D2017-07-11T08%3A00%3A28.000Z&date=%3C2017-08-11T08%3A00%3A28.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Status: RESOLVED → REOPENED
Priority: P3 → P1
Resolution: WORKSFORME → ---
Summary: Stylo: Crash in core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init → stylo: Crash in core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init | style::properties::cascade
(In reply to Chris Peterson [:cpeterson] from comment #3)
> Reopening this bug and marking this bug as Stylo P1 because this is top
> crash #10 in Beta 56. I see 88 instances of this crash signature in Beta
> 56.0b1, but none from Nightly 57.0a1.

Is the stylo experiment running in beta 56?
Flags: needinfo?(cpeterson)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4)
> Is the stylo experiment running in beta 56?

No. We haven't started the beta experiment yet. I see now that all 88 crash reports are from a single Windows 7 user. They must have manually set the layout.css.servo.enabled perf.

Many of their crash URLs look like JIRA bug dashboards. I'll ask around to see if anyone has access to a JIRA system.
Crash Signature: [@ core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init] → [@ core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init [@ core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init | style::properties::cascade ]
Flags: needinfo?(cpeterson)
Priority: P1 → P3
Doesn't seem to happen after build 20170813100233. Closing as WFM.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → WORKSFORME
I still see a couple crash reports from more recent Nightly 57 builds, such as bp-110ace36-65b4-469b-902b-bffe10170830 from build id 20170829100404. But not enough instances to warrant reopening this bug at this time.
For information, there are 102 crashes for 5 installations in 57.0b0 (dev edition).
The 57.0b crashes are from build ids 20170917031738 and 20170921191414 on Windows 7. There are no crashes reported in Nightly 58 yet.
Crash Signature: [@ core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init [@ core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init | style::properties::cascade ] → [@ core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init] [@ core::result::unwrap_failed<T> | std::collections::hash::map::{{impl}}::new::KEYS::__init | style::properties::cascade ]
These new reports all seem to have MOZ_CRASH_REASON:  to create an OS RNG: Error { repr: Os { code: 127, message: "OS Error 127 (FormatMessageW() returned error 15100)" } }
Status: RESOLVED → REOPENED
Flags: needinfo?(manishearth)
Resolution: WORKSFORME → ---
If we're failing to open an RNG filehandle or something, we should consider failing gracefully somehow and just not reseeding rather than bringing down the whole process.
Ok, so what's happening here is that the OS RNG returned some error (we don't know what). This triggered a panic, and in attempting to get more error information from the OS we get ERROR_MUI_FILE_NOT_FOUND, which means that the language resources (l10n files likely) for the system language are somehow broken, which also triggered a panic. This is not a double-panic situation since the second panic is triggered before the first panic actually starts, in the Display implementation of the error (which is called in attempting to create a string to pass to the panic).


I don't know how to fix this, but we can fix up the Windows error handling stuff in Rust to display the original error code when this double-error situation happens. Sound good?
Yeah, we're failing to somehow init an OS RNG instance.


The stdlib doesn't expose an error state for this, and hashglobe doesn't actually fork RandomState (we reuse the stdlib one).
Flags: needinfo?(manishearth)
Hm, wait, error code 127 is ERROR_PROC_NOT_FOUND.

That's ... weird.
unique crash reasons from the reports:

failed to create an OS RNG: Error { repr: Os { code: 127, message: "The specified procedure could not be found." } } 
failed to create an OS RNG: Error { repr: Os { code: 127, message: "OS Error 127 (FormatMessageW() returned error 15100)" } } 
failed to create an OS RNG: Error { repr: Os { code: -2146893818, message: "OS Error -2146893818 (FormatMessageW() returned error 15100)" } } 
failed to create an OS RNG: Error { repr: Os { code: -2146893818, message: "Firma no válida." } } 
failed to create an OS RNG: Error { repr: Os { code: -2146893801, message: "Provider type not defined." } } 
failed to create an OS RNG: Error { repr: Os { code: -2146893795, message: "Библиотека поставщика проинициализирована неправильно." } }
NI Manish to write up his analysis here per discussion in the meeting.
Flags: needinfo?(manishearth)
So my analysis of error 127 was basically "it shouldn't happen", that error occurs when you request a nonexistant API.

Similarly, the other error has to do with the provider not existing. However in OsRng we *always* pass down PROV_RSA_FULL and that's a preexisting provider (not user-defined) so it should always exist.

Basically this feels a lot like the DLL is corrupted somehow.

The Go folks had similar troubles and also chalked it up to DLL corruption on the target system (https://github.com/golang/go/issues/15655).

What's interesting is that Firefox seems to not have had such crashes before? (or maybe it has, I can't find anything on socorro) Perhaps we don't report the error as clearly (or we ignore it? looking into this) 


Since it's very likely a corrupted DLL, I'd probably wontfix this unless it gets more frequent. The fix involves forking RandomState as well and making all of its APIs fallible too, which won't be nice.
Flags: needinfo?(manishearth)
We've had periodic problems with Firefox startup crashes in calls to the Windows API rand_s() because bad third-party software injected DLL hooks into advapi32.dll (e.g. bug 694344, bug 1167248, bug 1369361). OsRng calls CryptGenRandom() in advapi32.dll, so maybe we're hitting a similar problem here.

I looked for suspicious third-party DLLs in a few of these crash reports' module lists, but I didn't see any DLLs in common that stood out. If we do find a problem DLL, we can add it to our DLL blocklist.
Now that Beta 57.0b3 is live, we're seeing more of these __init crashes. There are 477 crashes from 17 installations of 57.0b3.

I'm bumping this bug's priority from P3 to P2 because it is now Beta top crash #17.
Priority: P3 → P2
Could we add a separate codepath to fallibly try to create an OS RNG, and if that fails, set a flag on hashglobe to skip the random seed?
Flags: needinfo?(manishearth)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #20)
> Could we add a separate codepath to fallibly try to create an OS RNG, and if
> that fails, set a flag on hashglobe to skip the random seed?

(To be clear, I'm talking about doing this at startup)
We could, yes. A global atomic flag I presume?

I'm also going to file a Rust bug about this, this seems pretty bad.
Flags: needinfo?(manishearth)
(In reply to Manish Goregaokar [:manishearth] from comment #22)
> We could, yes. A global atomic flag I presume?

Yep. If we load/store it Relaxed it should be free.

> I'm also going to file a Rust bug about this, this seems pretty bad.

Yeah, definitely worth doing.
I have interviews tomorrow, but will try to get to this as soon as i can.
Flags: needinfo?(manishearth)
Assignee: nobody → manishearth
FWIW I tried the approach of forking RandomState first because if it's actually easy I'd prefer to do that over using global statics.

It's not easy.

The problem is that while it's easy to fork RandomState and have it use hardcoded "random"[1] values when OsRng fails, the BuildHasher implementation on RandomState can't be forked, because it needs the ability to initialize a SipHasher13 with its random state. SipHasher13 is also an unstable/private API. We could fork SipHasher13 and DefaultHasher as well, but that's a lot of forking. Alternatively we could fork DefaultHasher only and switch it over to using SipHasher24, which for some reason does have a stable API. I believe both hashing algorithms are fine for this purpose but one may be slower than the other; and in this case the penalty would be borne by all.

So I'll stick to using the OsRng.











 [1]: https://xkcd.com/221/
Flags: needinfo?(manishearth)
> So I'll stick to using the OsRng.

er, stick to using the "trigger OsRng in Servo_Initialize() with a once-init static" solution
Really this patch should remove the new() functions entirely so when we unfork HashMap we'll still use FNV. I'll make this change after emilio's patch for fixing the property hashset lands.
Comment on attachment 8913848 [details]
Bug 1385971 - stylo: Use FnvHashMap everywhere, remove default HashMap construction methods;

https://reviewboard.mozilla.org/r/185240/#review190244

Per IRC discussion, we're just going to remove the default here.
Attachment #8913848 - Flags: review?(bobbyholley) → review-
See Also: → 1404647
Note that even though this hasn't landed, https://github.com/servo/servo/pull/18679 should have removed the offending code.
Comment on attachment 8913848 [details]
Bug 1385971 - stylo: Use FnvHashMap everywhere, remove default HashMap construction methods;

https://reviewboard.mozilla.org/r/185240/#review190670
Attachment #8913848 - Flags: review?(bobbyholley) → review+
https://github.com/servo/servo/pull/18712

We'll want to uplift both, all of these seem to be in paths that will eventually get called; the cascade() one just gets called first.
Approval Request Comment
[Feature/Bug causing the regression]: None
[User impact if declined]: May crash on windows sometimes
[Is this code covered by automated tests?]: No
[Has the fix been verified in Nightly?]: No (but the crashing code is gone)
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: The other patch attached here
[Is the change risky?]: May cause minor changes in performance, probably positive
[Why is the change risky/not risky?]:
[String changes made/needed]: None
Attachment #8915011 - Flags: approval-mozilla-beta?
Attachment #8915012 - Flags: approval-mozilla-beta?
Can we mark this bug as fixed now given the comment 35?
Flags: needinfo?(manishearth)
https://hg.mozilla.org/mozilla-central/rev/be0e197531f2

yep
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Flags: needinfo?(manishearth)
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
(In reply to Manish Goregaokar [:manishearth] from comment #35)
> Emilio's patch: https://hg.mozilla.org/mozilla-central/rev/7e790d5276a1

Note to relman - I really want the above in the next beta, since it may help us diagnose some other unrelated crashes.
Flags: needinfo?(rkothari)
Comment on attachment 8915011 [details] [diff] [review]
Emilio's patch (for uplift)

Crash fix in Stylo code, Beta57+
Flags: needinfo?(rkothari)
Attachment #8915011 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8915012 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Hi Stuart, Kevin, Joel (from the monitoring/perf team), there is a potential perf impact (mostly positive). Just FYI in case you notice something on the talos tests.
Flags: needinfo?(sphilp)
Flags: needinfo?(kbrosnan)
Flags: needinfo?(jmaher)
The uplift patches attached don't apply cleanly to Beta. Please rebase.
Flags: needinfo?(manishearth)
I'll do the uplift.
Flags: needinfo?(manishearth)
thanks- we will keep an eye out for this.
Flags: needinfo?(jmaher)
Flags: needinfo?(kbrosnan)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: