Closed Bug 1456150 Opened 3 years ago Closed 3 years ago

Crash in static void alloc_system::platform::{{impl}}::oom

Categories

(Core :: General, defect)

Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox60 --- unaffected
firefox61 + fixed
firefox62 + fixed
firefox63 + fixed

People

(Reporter: marcia, Assigned: glandium)

References

Details

(Keywords: crash)

Crash Data

This bug was filed from the Socorro interface and is
report bp-ee778cd8-02e2-4c23-97e0-e5d270180331.
=============================================================

Seen while looking at nightly crash stats - this one goes back to the build from 20180331045244: https://bit.ly/2HEHhIa. All of the crashes have crash reason "EXCEPTION_ILLEGAL_INSTRUCTION"

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b7fa9d95150ec24e3c8d11106153f76be4507356&tochange=3f9a70b125f6fb9be2b209159657fd7ae5515c01

Top 4 frames of crashing thread:

0 xul.dll static void alloc_system::platform::{{impl}}::oom src/liballoc_system/lib.rs:449
1 xul.dll static void alloc_system::{{impl}}::oom src/liballoc_system/lib.rs:81
2 xul.dll static void std::heap::__default_lib_allocator::__rdl_oom src/libstd/heap.rs:49
3 xul.dll xul.dll@0x5b0df8 

=============================================================
It looks like all of the Rust OOM signatures are getting bucketed together.

The first problem I see is that Rust stack walking isn't getting past the first three frames. Maybe this is a regression from "Bug 1447116 - Update rust builders to 1.25"?

The second problem is that the crash stats signature should go past the first frame. I'm not sure what Rust OOM signatures looked like before so I can't tell if that's a regression or not. Ideally, we'd have one frame that would indicate that it is an OOM, and then a frame from the place that is actually causing the OOM.

The third issue then is that we're hitting an OOM in some places. I don't know if these are, separately, enough of an issue to warrant attention. Looking through the proto signatures for reports that actually have useful information after the junk xul.dll frame, I see some crashes that are OOMing in Servo media query list code like this:
  bp-d3ddd86f-76ea-4f5f-a214-f15880180421

For some of the signatures where there are no extra frames, a lot of them seem to be on the MediaPlayback thread, so I assume that the Rust media parser thing is involved somehow.
Nathan, do you know who would look into Rust stack traces not working right? Possibly as a regression from a Rust upgrade. Thanks.
Blocks: 1447116
Flags: needinfo?(nfroyd)
Unfortunately the Rust code is not nice enough to tell us what the OOM amount might be...but the crash dump says our smallest contiguous VM block is < 2MB, so that's an issue no matter what size we're trying to allocate.

Ted is the person who has handled Rust stack related things before, but perhaps he has somebody else to suggest?

Rust 1.25 has caused some other, performance related issues; perhaps we should revert to 1.24 for this cycle while we sort out the Rust 1.25 problems?  Chris, WDYT?
Flags: needinfo?(ted)
Flags: needinfo?(nfroyd)
Flags: needinfo?(cmanchester)
Can we confirm this is a rust 1.25 regression? I guess it seems likely given the regression range.

I'm trying to determine in bug 1451703 if staying on 1.24 as a result of that regression would be useful. If so we can back out and see if that helps.
Flags: needinfo?(cmanchester)
Sure, I can dig around in crash stats and try to figure out what OOM crashes looked like before.
* First question: what do Rust OOMs look like? For this, I looked at crashes where the protosignature contains _oom, but the signature does not contain mozalloc_abort (to try to filter out C++ crashes), from the past 3 months, across all branches.

These are the top three:
std::heap::__default_lib_allocator::__rdl_oom (913 crashes)
static void alloc_system::platform::{{impl}}::oom (104 crashes)
__rust_oom (49 crashes)

There are also a smattering of less-common signatures, such as:
  OOM | large | std::heap::__default_lib_allocator::__rdl_oom (from a recent Firefox)
  OOM | unknown | alloc::oom::default_oom_handler | alloc::oom::oom | mp4parse::skip<T>  (from Firefox 54)
So it looks like sometimes Rust crashes are correctly identified as OOM. I'll try to find out where the code is for that.

* Second question, what do the most common Rust OOMs look like on Nightly? For this, I looked at crashes where the protosignature contains any of the top three OOM stack frames I identified above, but only for Nightly 61. I looked at the period from March 10, slightly before the first Nightly 61, to April 18. That's about a week and a half before and after March 31, when the compiler change landed. There are exactly 7 such crashes before the 3-31 build, and 88 after.

So, it looks like OOM crashes went up massively on the 3-31 build. I don't know if that is due to the Rust upgrade, or something else. (Maybe a WebRender or Stylo change landed in the same Nightly.)

* What about the xul.dll frames? None of the 7 crashes before the 3-31 build (for example: bp-2369c432-2382-4384-bd06-cb37b0180320 and bp-bda1d6da-38ca-4ffa-8759-b884c0180322) have the alloc_system::platform::{{impl}}::oom signature, and their stacks look good.

* alloc_system::platform::{{impl}}::oom in particular: I don't see any crashes on any branch prior to the 3-31 Nightly build that have the alloc_system::platform::{{impl}}::oom signature. All of the alloc_system::platform::{{impl}}::oom signatures seem to contain xul.dll. About half of these crashes contain more frames, but half do not.

TLDR: Rust OOM crashes went up massively on the Nightly 3-31 build (though these are still not very common crashes over all), and this alloc_system::platform::{{impl}}::oom signature also appeared there, and has an unsymbolized frame (and, half the time, no more frames), whereas the handful of older Nightly crashes did not have these unsymbolized frames.
(the lack of stack traces)
So re: the OOM signatures being bucketed together, we already have a bunch of Rust std library frames in the prefix list:
https://github.com/mozilla-services/socorro/blob/master/socorro/siglists/prefix_signature_re.txt

Looking at the blame, bug 1373272 and bug 1389474 added those, but it looks like either the function names changed in the stdlib or the formatting of the names changed.
I loaded the minidump from the crash in comment 0 in WinDBG and it was able to unwind farther:
 # Child-SP          RetAddr           Call Site
00 00000054`d23fb4c0 00007ffa`db865442 xul!alloc_system::platform::{{impl}}::oom(void)+0x5c [C:\projects\rust\src\liballoc_system\lib.rs @ 449]
01 00000054`d23fb540 00007ffa`db864b42 xul!alloc_system::{{impl}}::oom(void)+0x22 [C:\projects\rust\src\liballoc_system\lib.rs @ 81]
02 00000054`d23fb590 00007ffa`db870df9 xul!std::heap::__default_lib_allocator::__rdl_oom(void)+0x22 [C:\projects\rust\src\libstd\heap.rs @ 49]
03 00000054`d23fb5e0 00000000`000005af xul!core::fmt::{{impl}}::fmt+0x39
04 00000054`d23fb5e8 00000054`d23fb700 0x5af
05 00000054`d23fb5f0 00000000`00000001 0x00000054`d23fb700
06 00000054`d23fb5f8 00000054`d23fb700 0x1
07 00000054`d23fb600 0000021d`45068240 0x00000054`d23fb700
08 00000054`d23fb608 00007ffa`db358a9f 0x0000021d`45068240
09 00000054`d23fb610 00007ffa`db5ed60d xul!alloc::heap::{{impl}}::oom(void)+0xf [C:\projects\rust\src\liballoc\heap.rs @ 98]
0a (Inline Function) --------`-------- xul!smallvec::SmallVec<[style::gecko::wrapper::GeckoElement; 128]>::push+0x82 [z:\build\build\src\third_party\rust\smallvec\lib.rs @ 401]
0b 00000054`d23fb640 00007ffa`db5ec8eb xul!style::dom_apis::{{impl}}::append_element<style::gecko::wrapper::GeckoElement>(struct smallvec::SmallVec<[style::gecko::wrapper::GeckoElement; 128]> * output = 0x00000054`d23fb700, struct style::gecko::wrapper::GeckoElement element = struct style::gecko::wrapper::GeckoElement)+0x9d [z:\build\build\src\servo\components\style\dom_apis.rs @ 97]
0c (Inline Function) --------`-------- xul!style::dom_apis::collect_all_elements+0xb [z:\build\build\src\servo\components\style\dom_apis.rs @ 219]
0d (Inline Function) --------`-------- xul!style::dom_apis::query_selector_single_query+0x199 [z:\build\build\src\servo\components\style\dom_apis.rs @ 332]
0e (Inline Function) --------`-------- xul!style::dom_apis::query_selector_fast+0x22c [z:\build\build\src\servo\components\style\dom_apis.rs @ 392]
0f (Inline Function) --------`-------- xul!style::dom_apis::query_selector+0x335 [z:\build\build\src\servo\components\style\dom_apis.rs @ 551]
10 00000054`d23fb6d0 00007ffa`dbce2cf1 xul!geckoservo::glue::Servo_SelectorList_QueryAll(struct style::gecko_bindings::structs::root::nsINode * node = <Value unavailable error>, struct style::gecko_bindings::structs::root::RawServoSelectorList * selectors = 0x0000021d`482593e0, struct style::gecko_bindings::structs::root::nsSimpleContentList * content_list = 0x0000021d`4ec37af0, bool may_use_invalidation = <Value unavailable error>)+0x36b [z:\build\build\src\servo\ports\geckolib\glue.rs @ 1962]
11 00000054`d23fc3b0 00007ffa`dbce29d7 xul!nsINode::QuerySelectorAll(class nsTSubstring<char16_t> * aSelector = 0x00000054`d23fc470, class mozilla::ErrorResult * aResult = <Value unavailable error>)+0x71 [z:\build\build\src\dom\base\nsinode.cpp @ 2614]
12 00000054`d23fc3f0 00007ffa`db9bff63 xul!mozilla::dom::ElementBinding::querySelectorAll(struct JSContext * cx = 0x0000021d`3c323000, class JS::Handle<JSObject *> obj = class JS::Handle<JSObject *>, class mozilla::dom::Element * self = 0x0000021d`4705c900, class JSJitMethodCallArgs * args = 0x00000054`d23fc570)+0xbf [z:\build\build\src\obj-firefox\dom\bindings\elementbinding.cpp @ 3849]
13 00000054`d23fc550 000003c3`b490ac2b xul!mozilla::dom::GenericBindingMethod(struct JSContext * cx = 0x0000021d`3c323000, unsigned int argc = 1, union JS::Value * vp = <Value unavailable error>)+0x11b [z:\build\build\src\dom\bindings\bindingutils.cpp @ 3038]
14 00000054`d23fc5e0 00000000`00000000 0x000003c3`b490ac2b


There are quite a few junk frames in the middle there, so perhaps something in the Rust code is not generating good unwind info?
Flags: needinfo?(ted)
I don't have time to dig into this further. Perhaps mw can look at whether rustc is generating good unwind info for these frames? We previously dealt with this in bug 1302078, and acrichto landed a fix in rustc to make it always emit unwind info for win64 even when building with `-Cpanic=abort` in https://github.com/rust-lang/rust/pull/40549.
I noticed in https://github.com/rust-lang/rust/pull/50263 that we're not actually emitting the `uwtable` thing for LLVM for *all* functions (a few allocator-related shims were accidentally exempt). I'm not sure if that'll actually fix the issue here.

The OOM handling on nightly is also slightly different than that on stable right now, so if this is purely related to the `uwtable` attribute accidentally being omitted then it may already be fixed on nightly.

Is there a good way to test out a patch here? Or is it mainly a "wait and see" scenario?
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11)
> I don't have time to dig into this further. Perhaps mw can look at whether
> rustc is generating good unwind info for these frames? We previously dealt
> with this in bug 1302078, and acrichto landed a fix in rustc to make it
> always emit unwind info for win64 even when building with `-Cpanic=abort` in
> https://github.com/rust-lang/rust/pull/40549.

Does breakpad need unwind info, though? It doesn't for DWARF, maybe it's different for pdb?
(In reply to Mike Hommey [:glandium] from comment #13)
> Does breakpad need unwind info, though? It doesn't for DWARF, maybe it's
> different for pdb?

These are Win64 crashes, and I don't think we have any way to reliably unwind those stacks without it. I'm not 100% sure but I would guess that the `uwtable` attribute controls whether we wind up with entries in the unwind tables in the PE.

(In reply to Alex Crichton [:acrichto] from comment #12)
> Is there a good way to test out a patch here? Or is it mainly a "wait and
> see" scenario?

If there's a way to generate a Rust OOM crash for testing purposes I think we could figure something out. It's easy enough to tweak the Rust toolchain version in a try push, you can just change the arguments to repack_rust.py and we'll use that Rust version in the Firefox build:
https://dxr.mozilla.org/mozilla-central/rev/63a0e2f626febb98d87d2543955ab99a653654ff/taskcluster/ci/toolchain/windows.yml#137

It does require an explicit date like `nightly-2018-04-26`.

We could tweak the internals of nsIDebug2::RustPanic to trigger an OOM in the same try push:
https://dxr.mozilla.org/mozilla-central/rev/63a0e2f626febb98d87d2543955ab99a653654ff/toolkit/library/rust/shared/lib.rs#37

Then download and run the resulting build, call `rustPanic()` and run minidump_stackwalk on the minidump.
Ok cool! I'll ping this thread when the support lands in rust-lang/rust and reaches a nightly. I believe you can "trigger an OOM" by basically just calling std::alloc::oom (https://doc.rust-lang.org/nightly/std/alloc/fn.oom.html) on recent nightlies, as that's what collections and such will be doing anyway.
Ok that was a little speedier than I thought it would be! The `nightly-2018-04-28` toolchain for MSVC should have the fix in https://github.com/rust-lang/rust/pull/50263, want to try compiling with that and see if it improves the situation here?
I got a working build out of that, but `std::alloc::oom` now crashes the process in a way that Breakpad cannot handle, so I can't use it to verify this. :-/

Specifically it uses `rtabort!` which calls `sys_common::util::abort` which calls `abort_internal`:
https://dxr.mozilla.org/rust/rev/0d8321b5e87c87aa4dbea729b4dd722740fac645/src/libstd/sys/windows/mod.rs#261

This is presumably why glandium filed bug 1458161.
Oh we can probably change that to `intrinsics::abort()` for sure, would that help show up with the same signature as before?
Fx61 is fixed by backout at this point, but we should probably keep this on the radar for whenever 62 gets a newer Rust version again.
bug 1458161 comment 12 shows that we get slightly better stacks with glandium's patch. I'll rebase my test patch atop that and see what it looks like.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21)
> bug 1458161 comment 12 shows that we get slightly better stacks with
> glandium's patch. I'll rebase my test patch atop that and see what it looks
> like.

That won't work. Bug 1458161 explicitly doesn't work on 1.27, and can't be made to work without changes to libstd that I'm working on.
Is is still an issue in 1.26 (released yesterday)? Should this still block upgrading from 1.24?
The speculative fix is in 1.27, so probably nothing changed in 1.26.
I don't see any crashes with this signature in 62, so far.
Alex, would it be possible to get https://github.com/rust-lang/rust/pull/50263 uplifted to stable rust, so we can try landing 1.26 in mozilla central?
Flags: needinfo?(acrichton)
(In reply to Chris Manchester (:chmanchester) from comment #27)
> Alex, would it be possible to get
> https://github.com/rust-lang/rust/pull/50263 uplifted to stable rust, so we
> can try landing 1.26 in mozilla central?

1.26 still has the encoding_rs performance regression. We'll have to wait for 1.28.
(well, more precisely, we'll have to wait whatever version has https://github.com/rust-lang/rust/pull/50880 merged in)
From what glandium mentioned it sounds like 1.26 may not work well? We're doing a point release for 1.26.1 but I think we're unlikely to get all this in there. That being said though I think we can definitely make sure to get everything in 1.27, for example:

* ustable on allocator shims - https://github.com/rust-lang/rust/pull/50263 - already on beta
* fixing encoding_rs perf - https://github.com/rust-lang/rust/pull/50827 - being backported to beta
* OOM and memory request - https://github.com/rust-lang/rust/pull/50880 - can be backported if necessary

Does that sound ok? Or do y'all otherwise have a pressing need to upgrade to 1.26?
Flags: needinfo?(acrichton)
(In reply to Alex Crichton [:acrichto] from comment #30)
> * OOM and memory request - https://github.com/rust-lang/rust/pull/50880 -
> can be backported if necessary

That would require a patch with #[cfg(stage0)] hoop-jumping, though.
Sure yeah it may not be easy, but if Gecko needs it we can get it done
We can wait and see how things go in early beta 62. I'll keep this tracked in the meantime.
(In reply to Alex Crichton [:acrichto] from comment #30)
> From what glandium mentioned it sounds like 1.26 may not work well? We're
> doing a point release for 1.26.1 but I think we're unlikely to get all this
> in there. That being said though I think we can definitely make sure to get
> everything in 1.27, for example:
> 
> * ustable on allocator shims - https://github.com/rust-lang/rust/pull/50263
> - already on beta
> * fixing encoding_rs perf - https://github.com/rust-lang/rust/pull/50827 -
> being backported to beta
> * OOM and memory request - https://github.com/rust-lang/rust/pull/50880 -
> can be backported if necessary
> 
> Does that sound ok? Or do y'all otherwise have a pressing need to upgrade to
> 1.26?

1.27 is going to ship during the next soft code freeze, for that reason as well we should probably miss that train. At this point 1.28 will end up being our next rustc upgrade, which will ship in time for the September 4th merge.
Good to know about the plans for Rust.

Since the signatures mentioned earlier in this bug aren't showing any crashes in the last few weeks, do we need to keep this particular bug open?
The various changes since this bug was filed made the OOM signatures change. It would be worth searching for the new signatures and associated them to this bug too. Unfortunately, nothing seems to come up searching for "GeckoHandleOOM" on crash-stats, which is not really surprising, since the function just calls another, and most probably doesn't crate a stack frame. I tried looking at crashes in mozalloc_handle_oom that don't come from moz_xmalloc/moz_xrealloc, but I haven't got anything useful from the crash-stats ui (can't find a way to do a search for "signature contains mozalloc_handle_oom" *and* "signature does not contain moz_xmalloc").
Fx62 is using Rust 1.24 still and Fx63 was able to update to 1.28 without this bug re-appearing. I think we're good here.
Assignee: nobody → mh+mozilla
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
You need to log in before you can comment on or make changes to this bug.