1456150 - Crash in static void alloc_system::platform::{{impl}}::oom

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is report bp-ee778cd8-02e2-4c23-97e0-e5d270180331. ============================================================= Seen while looking at nightly crash stats - this one goes back to the build from 20180331045244: https://bit.ly/2HEHhIa. All of the crashes have crash reason "EXCEPTION_ILLEGAL_INSTRUCTION" Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b7fa9d95150ec24e3c8d11106153f76be4507356&tochange=3f9a70b125f6fb9be2b209159657fd7ae5515c01 Top 4 frames of crashing thread: 0 xul.dll static void alloc_system::platform::{{impl}}::oom src/liballoc_system/lib.rs:449 1 xul.dll static void alloc_system::{{impl}}::oom src/liballoc_system/lib.rs:81 2 xul.dll static void std::heap::__default_lib_allocator::__rdl_oom src/libstd/heap.rs:49 3 xul.dll xul.dll@0x5b0df8 =============================================================

Andrew McCreight [:mccr8]

Comment 1

•

8 years ago

It looks like all of the Rust OOM signatures are getting bucketed together. The first problem I see is that Rust stack walking isn't getting past the first three frames. Maybe this is a regression from "Bug 1447116 - Update rust builders to 1.25"? The second problem is that the crash stats signature should go past the first frame. I'm not sure what Rust OOM signatures looked like before so I can't tell if that's a regression or not. Ideally, we'd have one frame that would indicate that it is an OOM, and then a frame from the place that is actually causing the OOM. The third issue then is that we're hitting an OOM in some places. I don't know if these are, separately, enough of an issue to warrant attention. Looking through the proto signatures for reports that actually have useful information after the junk xul.dll frame, I see some crashes that are OOMing in Servo media query list code like this: bp-d3ddd86f-76ea-4f5f-a214-f15880180421 For some of the signatures where there are no extra frames, a lot of them seem to be on the MediaPlayback thread, so I assume that the Rust media parser thing is involved somehow.

Andrew McCreight [:mccr8]

Comment 2

•

8 years ago

Nathan, do you know who would look into Rust stack traces not working right? Possibly as a regression from a Rust upgrade. Thanks.

Blocks: 1447116

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 3

•

8 years ago

Unfortunately the Rust code is not nice enough to tell us what the OOM amount might be...but the crash dump says our smallest contiguous VM block is < 2MB, so that's an issue no matter what size we're trying to allocate. Ted is the person who has handled Rust stack related things before, but perhaps he has somebody else to suggest? Rust 1.25 has caused some other, performance related issues; perhaps we should revert to 1.24 for this cycle while we sort out the Rust 1.25 problems? Chris, WDYT?

Flags: needinfo?(ted)

Flags: needinfo?(nfroyd)

Flags: needinfo?(cmanchester)

Chris Manchester (limited bugmail, email directly)

Comment 4

•

8 years ago

Can we confirm this is a rust 1.25 regression? I guess it seems likely given the regression range. I'm trying to determine in bug 1451703 if staying on 1.24 as a result of that regression would be useful. If so we can back out and see if that helps.

Flags: needinfo?(cmanchester)

Andrew McCreight [:mccr8]

Comment 5

•

8 years ago

Sure, I can dig around in crash stats and try to figure out what OOM crashes looked like before.

Andrew McCreight [:mccr8]

Comment 6

•

8 years ago

* First question: what do Rust OOMs look like? For this, I looked at crashes where the protosignature contains _oom, but the signature does not contain mozalloc_abort (to try to filter out C++ crashes), from the past 3 months, across all branches. These are the top three: std::heap::__default_lib_allocator::__rdl_oom (913 crashes) static void alloc_system::platform::{{impl}}::oom (104 crashes) __rust_oom (49 crashes) There are also a smattering of less-common signatures, such as: OOM | large | std::heap::__default_lib_allocator::__rdl_oom (from a recent Firefox) OOM | unknown | alloc::oom::default_oom_handler | alloc::oom::oom | mp4parse::skip<T> (from Firefox 54) So it looks like sometimes Rust crashes are correctly identified as OOM. I'll try to find out where the code is for that. * Second question, what do the most common Rust OOMs look like on Nightly? For this, I looked at crashes where the protosignature contains any of the top three OOM stack frames I identified above, but only for Nightly 61. I looked at the period from March 10, slightly before the first Nightly 61, to April 18. That's about a week and a half before and after March 31, when the compiler change landed. There are exactly 7 such crashes before the 3-31 build, and 88 after. So, it looks like OOM crashes went up massively on the 3-31 build. I don't know if that is due to the Rust upgrade, or something else. (Maybe a WebRender or Stylo change landed in the same Nightly.) * What about the xul.dll frames? None of the 7 crashes before the 3-31 build (for example: bp-2369c432-2382-4384-bd06-cb37b0180320 and bp-bda1d6da-38ca-4ffa-8759-b884c0180322) have the alloc_system::platform::{{impl}}::oom signature, and their stacks look good. * alloc_system::platform::{{impl}}::oom in particular: I don't see any crashes on any branch prior to the 3-31 Nightly build that have the alloc_system::platform::{{impl}}::oom signature. All of the alloc_system::platform::{{impl}}::oom signatures seem to contain xul.dll. About half of these crashes contain more frames, but half do not. TLDR: Rust OOM crashes went up massively on the Nightly 3-31 build (though these are still not very common crashes over all), and this alloc_system::platform::{{impl}}::oom signature also appeared there, and has an unsymbolized frame (and, half the time, no more frames), whereas the handful of older Nightly crashes did not have these unsymbolized frames.

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

tracking-firefox61: --- → +

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 7

•

8 years ago

This could be related: https://github.com/rust-lang/rust/issues/50138

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 8

•

8 years ago

(the lack of stack traces)

(not currently active) Ted Mielczarek

Comment 9

•

8 years ago

So re: the OOM signatures being bucketed together, we already have a bunch of Rust std library frames in the prefix list: https://github.com/mozilla-services/socorro/blob/master/socorro/siglists/prefix_signature_re.txt Looking at the blame, bug 1373272 and bug 1389474 added those, but it looks like either the function names changed in the stdlib or the formatting of the names changed.

(not currently active) Ted Mielczarek

Comment 10

•

8 years ago

I loaded the minidump from the crash in comment 0 in WinDBG and it was able to unwind farther: # Child-SP RetAddr Call Site 00 00000054`d23fb4c0 00007ffa`db865442 xul!alloc_system::platform::{{impl}}::oom(void)+0x5c [C:\projects\rust\src\liballoc_system\lib.rs @ 449] 01 00000054`d23fb540 00007ffa`db864b42 xul!alloc_system::{{impl}}::oom(void)+0x22 [C:\projects\rust\src\liballoc_system\lib.rs @ 81] 02 00000054`d23fb590 00007ffa`db870df9 xul!std::heap::__default_lib_allocator::__rdl_oom(void)+0x22 [C:\projects\rust\src\libstd\heap.rs @ 49] 03 00000054`d23fb5e0 00000000`000005af xul!core::fmt::{{impl}}::fmt+0x39 04 00000054`d23fb5e8 00000054`d23fb700 0x5af 05 00000054`d23fb5f0 00000000`00000001 0x00000054`d23fb700 06 00000054`d23fb5f8 00000054`d23fb700 0x1 07 00000054`d23fb600 0000021d`45068240 0x00000054`d23fb700 08 00000054`d23fb608 00007ffa`db358a9f 0x0000021d`45068240 09 00000054`d23fb610 00007ffa`db5ed60d xul!alloc::heap::{{impl}}::oom(void)+0xf [C:\projects\rust\src\liballoc\heap.rs @ 98] 0a (Inline Function) --------`-------- xul!smallvec::SmallVec<[style::gecko::wrapper::GeckoElement; 128]>::push+0x82 [z:\build\build\src\third_party\rust\smallvec\lib.rs @ 401] 0b 00000054`d23fb640 00007ffa`db5ec8eb xul!style::dom_apis::{{impl}}::append_element<style::gecko::wrapper::GeckoElement>(struct smallvec::SmallVec<[style::gecko::wrapper::GeckoElement; 128]> * output = 0x00000054`d23fb700, struct style::gecko::wrapper::GeckoElement element = struct style::gecko::wrapper::GeckoElement)+0x9d [z:\build\build\src\servo\components\style\dom_apis.rs @ 97] 0c (Inline Function) --------`-------- xul!style::dom_apis::collect_all_elements+0xb [z:\build\build\src\servo\components\style\dom_apis.rs @ 219] 0d (Inline Function) --------`-------- xul!style::dom_apis::query_selector_single_query+0x199 [z:\build\build\src\servo\components\style\dom_apis.rs @ 332] 0e (Inline Function) --------`-------- xul!style::dom_apis::query_selector_fast+0x22c [z:\build\build\src\servo\components\style\dom_apis.rs @ 392] 0f (Inline Function) --------`-------- xul!style::dom_apis::query_selector+0x335 [z:\build\build\src\servo\components\style\dom_apis.rs @ 551] 10 00000054`d23fb6d0 00007ffa`dbce2cf1 xul!geckoservo::glue::Servo_SelectorList_QueryAll(struct style::gecko_bindings::structs::root::nsINode * node = <Value unavailable error>, struct style::gecko_bindings::structs::root::RawServoSelectorList * selectors = 0x0000021d`482593e0, struct style::gecko_bindings::structs::root::nsSimpleContentList * content_list = 0x0000021d`4ec37af0, bool may_use_invalidation = <Value unavailable error>)+0x36b [z:\build\build\src\servo\ports\geckolib\glue.rs @ 1962] 11 00000054`d23fc3b0 00007ffa`dbce29d7 xul!nsINode::QuerySelectorAll(class nsTSubstring<char16_t> * aSelector = 0x00000054`d23fc470, class mozilla::ErrorResult * aResult = <Value unavailable error>)+0x71 [z:\build\build\src\dom\base\nsinode.cpp @ 2614] 12 00000054`d23fc3f0 00007ffa`db9bff63 xul!mozilla::dom::ElementBinding::querySelectorAll(struct JSContext * cx = 0x0000021d`3c323000, class JS::Handle<JSObject *> obj = class JS::Handle<JSObject *>, class mozilla::dom::Element * self = 0x0000021d`4705c900, class JSJitMethodCallArgs * args = 0x00000054`d23fc570)+0xbf [z:\build\build\src\obj-firefox\dom\bindings\elementbinding.cpp @ 3849] 13 00000054`d23fc550 000003c3`b490ac2b xul!mozilla::dom::GenericBindingMethod(struct JSContext * cx = 0x0000021d`3c323000, unsigned int argc = 1, union JS::Value * vp = <Value unavailable error>)+0x11b [z:\build\build\src\dom\bindings\bindingutils.cpp @ 3038] 14 00000054`d23fc5e0 00000000`00000000 0x000003c3`b490ac2b There are quite a few junk frames in the middle there, so perhaps something in the Rust code is not generating good unwind info?

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 11

•

8 years ago

I don't have time to dig into this further. Perhaps mw can look at whether rustc is generating good unwind info for these frames? We previously dealt with this in bug 1302078, and acrichto landed a fix in rustc to make it always emit unwind info for win64 even when building with `-Cpanic=abort` in https://github.com/rust-lang/rust/pull/40549.

Alex Crichton [:acrichto]

Comment 12

•

8 years ago

I noticed in https://github.com/rust-lang/rust/pull/50263 that we're not actually emitting the `uwtable` thing for LLVM for *all* functions (a few allocator-related shims were accidentally exempt). I'm not sure if that'll actually fix the issue here. The OOM handling on nightly is also slightly different than that on stable right now, so if this is purely related to the `uwtable` attribute accidentally being omitted then it may already be fixed on nightly. Is there a good way to test out a patch here? Or is it mainly a "wait and see" scenario?

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 13

•

8 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #11) > I don't have time to dig into this further. Perhaps mw can look at whether > rustc is generating good unwind info for these frames? We previously dealt > with this in bug 1302078, and acrichto landed a fix in rustc to make it > always emit unwind info for win64 even when building with `-Cpanic=abort` in > https://github.com/rust-lang/rust/pull/40549. Does breakpad need unwind info, though? It doesn't for DWARF, maybe it's different for pdb?

(not currently active) Ted Mielczarek

Comment 14

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #13) > Does breakpad need unwind info, though? It doesn't for DWARF, maybe it's > different for pdb? These are Win64 crashes, and I don't think we have any way to reliably unwind those stacks without it. I'm not 100% sure but I would guess that the `uwtable` attribute controls whether we wind up with entries in the unwind tables in the PE. (In reply to Alex Crichton [:acrichto] from comment #12) > Is there a good way to test out a patch here? Or is it mainly a "wait and > see" scenario? If there's a way to generate a Rust OOM crash for testing purposes I think we could figure something out. It's easy enough to tweak the Rust toolchain version in a try push, you can just change the arguments to repack_rust.py and we'll use that Rust version in the Firefox build: https://dxr.mozilla.org/mozilla-central/rev/63a0e2f626febb98d87d2543955ab99a653654ff/taskcluster/ci/toolchain/windows.yml#137 It does require an explicit date like `nightly-2018-04-26`. We could tweak the internals of nsIDebug2::RustPanic to trigger an OOM in the same try push: https://dxr.mozilla.org/mozilla-central/rev/63a0e2f626febb98d87d2543955ab99a653654ff/toolkit/library/rust/shared/lib.rs#37 Then download and run the resulting build, call `rustPanic()` and run minidump_stackwalk on the minidump.

Alex Crichton [:acrichto]

Comment 15

•

8 years ago

Ok cool! I'll ping this thread when the support lands in rust-lang/rust and reaches a nightly. I believe you can "trigger an OOM" by basically just calling std::alloc::oom (https://doc.rust-lang.org/nightly/std/alloc/fn.oom.html) on recent nightlies, as that's what collections and such will be doing anyway.

Alex Crichton [:acrichto]

Comment 16

•

8 years ago

Ok that was a little speedier than I thought it would be! The `nightly-2018-04-28` toolchain for MSVC should have the fix in https://github.com/rust-lang/rust/pull/50263, want to try compiling with that and see if it improves the situation here?

(not currently active) Ted Mielczarek

Comment 17

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0c74764b3782049da42762da8a90fbb12f0edcc8

(not currently active) Ted Mielczarek

Comment 18

•

8 years ago

I got a working build out of that, but `std::alloc::oom` now crashes the process in a way that Breakpad cannot handle, so I can't use it to verify this. :-/ Specifically it uses `rtabort!` which calls `sys_common::util::abort` which calls `abort_internal`: https://dxr.mozilla.org/rust/rev/0d8321b5e87c87aa4dbea729b4dd722740fac645/src/libstd/sys/windows/mod.rs#261 This is presumably why glandium filed bug 1458161.

Alex Crichton [:acrichto]

Comment 19

•

8 years ago

Oh we can probably change that to `intrinsics::abort()` for sure, would that help show up with the same signature as before?

Ryan VanderMeulen [:RyanVM]

Comment 20

•

8 years ago

Fx61 is fixed by backout at this point, but we should probably keep this on the radar for whenever 62 gets a newer Rust version again.

status-firefox60: --- → unaffected

status-firefox61: affected → fixed

status-firefox62: --- → affected

status-firefox-esr52: --- → unaffected

status-firefox-esr60: --- → unaffected

tracking-firefox62: --- → +

(not currently active) Ted Mielczarek

Comment 21

•

8 years ago

bug 1458161 comment 12 shows that we get slightly better stacks with glandium's patch. I'll rebase my test patch atop that and see what it looks like.

(not currently active) Ted Mielczarek

Comment 22

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a94d67b738da6a1a3ccbf63e3f95b68e7b4be19b

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 23

•

8 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #21) > bug 1458161 comment 12 shows that we get slightly better stacks with > glandium's patch. I'll rebase my test patch atop that and see what it looks > like. That won't work. Bug 1458161 explicitly doesn't work on 1.27, and can't be made to work without changes to libstd that I'm working on.

Simon Sapin (:SimonSapin)

Comment 24

•

8 years ago

Is is still an issue in 1.26 (released yesterday)? Should this still block upgrading from 1.24?

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 25

•

8 years ago

The speculative fix is in 1.27, so probably nothing changed in 1.26.

Liz Henry (:lizzard)

Comment 26

•

8 years ago

I don't see any crashes with this signature in 62, so far.

Chris Manchester (limited bugmail, email directly)

Comment 27

•

8 years ago

Alex, would it be possible to get https://github.com/rust-lang/rust/pull/50263 uplifted to stable rust, so we can try landing 1.26 in mozilla central?

Flags: needinfo?(acrichton)

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 28

•

8 years ago

(In reply to Chris Manchester (:chmanchester) from comment #27) > Alex, would it be possible to get > https://github.com/rust-lang/rust/pull/50263 uplifted to stable rust, so we > can try landing 1.26 in mozilla central? 1.26 still has the encoding_rs performance regression. We'll have to wait for 1.28.

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 29

•

8 years ago

(well, more precisely, we'll have to wait whatever version has https://github.com/rust-lang/rust/pull/50880 merged in)

Alex Crichton [:acrichto]

Comment 30

•

8 years ago

From what glandium mentioned it sounds like 1.26 may not work well? We're doing a point release for 1.26.1 but I think we're unlikely to get all this in there. That being said though I think we can definitely make sure to get everything in 1.27, for example: * ustable on allocator shims - https://github.com/rust-lang/rust/pull/50263 - already on beta * fixing encoding_rs perf - https://github.com/rust-lang/rust/pull/50827 - being backported to beta * OOM and memory request - https://github.com/rust-lang/rust/pull/50880 - can be backported if necessary Does that sound ok? Or do y'all otherwise have a pressing need to upgrade to 1.26?

Flags: needinfo?(acrichton)

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 31

•

8 years ago

(In reply to Alex Crichton [:acrichto] from comment #30) > * OOM and memory request - https://github.com/rust-lang/rust/pull/50880 - > can be backported if necessary That would require a patch with #[cfg(stage0)] hoop-jumping, though.

Alex Crichton [:acrichto]

Comment 32

•

8 years ago

Sure yeah it may not be easy, but if Gecko needs it we can get it done

Liz Henry (:lizzard)

Comment 33

•

8 years ago

We can wait and see how things go in early beta 62. I'll keep this tracked in the meantime.

Chris Manchester (limited bugmail, email directly)

Comment 34

•

8 years ago

(In reply to Alex Crichton [:acrichto] from comment #30) > From what glandium mentioned it sounds like 1.26 may not work well? We're > doing a point release for 1.26.1 but I think we're unlikely to get all this > in there. That being said though I think we can definitely make sure to get > everything in 1.27, for example: > > * ustable on allocator shims - https://github.com/rust-lang/rust/pull/50263 > - already on beta > * fixing encoding_rs perf - https://github.com/rust-lang/rust/pull/50827 - > being backported to beta > * OOM and memory request - https://github.com/rust-lang/rust/pull/50880 - > can be backported if necessary > > Does that sound ok? Or do y'all otherwise have a pressing need to upgrade to > 1.26? 1.27 is going to ship during the next soft code freeze, for that reason as well we should probably miss that train. At this point 1.28 will end up being our next rustc upgrade, which will ship in time for the September 4th merge.

Liz Henry (:lizzard)

Comment 35

•

7 years ago

Good to know about the plans for Rust. Since the signatures mentioned earlier in this bug aren't showing any crashes in the last few weeks, do we need to keep this particular bug open?

Liz Henry (:lizzard)

Updated

•

7 years ago

status-firefox62: affected → fix-optional

Mike Hommey [:glandium] OoO until Jun 18

Assignee

Comment 36

•

7 years ago

The various changes since this bug was filed made the OOM signatures change. It would be worth searching for the new signatures and associated them to this bug too. Unfortunately, nothing seems to come up searching for "GeckoHandleOOM" on crash-stats, which is not really surprising, since the function just calls another, and most probably doesn't crate a stack frame. I tried looking at crashes in mozalloc_handle_oom that don't come from moz_xmalloc/moz_xrealloc, but I haven't got anything useful from the crash-stats ui (can't find a way to do a search for "signature contains mozalloc_handle_oom" *and* "signature does not contain moz_xmalloc").

Ryan VanderMeulen [:RyanVM]

Comment 37

•

7 years ago

Fx62 is using Rust 1.24 still and Fx63 was able to update to 1.28 without this bug re-appearing. I think we're good here.

Assignee: nobody → mh+mozilla

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox62: fix-optional → fixed

status-firefox63: --- → fixed

tracking-firefox63: --- → +

Resolution: --- → FIXED

Target Milestone: --- → mozilla63