1266309 - crash in _ZN3std2io5stdio6_print17h0bae26b2b3e9ac51E

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is 
report bp-fe3a0851-557e-4ef8-be95-4c1782160417.
=============================================================

Rust crashes! I see seven with this signature in the past week, with the first one occurring in Nightly 20160415030231.

> 0 	kernelbase.dll 	RaiseException 	
> 1 	xul.dll 	_ZN3std2io5stdio6_print17h0bae26b2b3e9ac51E 	
> 2 	xul.dll 	_ZN3std10sys_common6unwind18begin_unwind_inner17hf40013815c474ad4E 	
> 3 	xul.dll 	_ZN3std10sys_common6unwind16begin_unwind_fmt17h42a3b0f6333a2a21E 	
> 4 	xul.dll 	rust_begin_unwind 	
> 5 	xul.dll 	mp4parse_read 	
> 6 	xul.dll 	mp4_demuxer::MP4MetadataRust::MP4MetadataRust(mp4_demuxer::Stream*) 	media/libstagefright/binding/MP4Metadata.cpp
> 7 	xul.dll 	mp4_demuxer::MP4Metadata::MP4Metadata(mp4_demuxer::Stream*) 	media/libstagefright/binding/MP4Metadata.cpp
> 8 	xul.dll 	mozilla::MP4Demuxer::Init() 	dom/media/fmp4/MP4Demuxer.cpp
> 9 	xul.dll 	mozilla::MediaFormatReader::AsyncReadMetadata() 	dom/media/MediaFormatReader.cpp
> 10 	xul.dll 	mozilla::detail::ProxyRunnable<mozilla::MozPromise<RefPtr<mozilla::MetadataHolder>, mozilla::ReadMetadataFailureReason, 1>, mozilla::MediaDecoderReader>::Run() 	obj-firefox/dist/include/mozilla/MozPromise.h

Here are frames 1--3 demangled:

> std::io::stdio::_print::h0bae26b2b3e9ac51
> std::sys_common::unwind::begin_unwind_inner::hf40013815c474ad4
> std::sys_common::unwind::begin_unwind_fmt::h42a3b0f6333a2a21

Nicholas Nethercote [inactive]

Reporter

Comment 1

•

8 years ago

rillian, can you please investigate?

Also, it's unfortunate that the Rust frames lack three important things: demangled names, line numbers, and links to source code. Do you know why? Are there bugs open to improve that?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

8 years ago

Flags: needinfo?(giles)

Nicholas Nethercote [inactive]

Reporter

Comment 2

•

8 years ago

> Also, it's unfortunate that the Rust frames lack three important things:
> demangled names, line numbers, and links to source code. Do you know why?

It doesn't even have the source file name.

Andrew McCreight [:mccr8]

Comment 3

•

8 years ago

Ted has also been working on some Oxidation stuff, but I'm not sure if that's more build or crash reporting stuff. But he might have some idea about improved crash reporting for Rust.

Nicholas Nethercote [inactive]

Reporter

Comment 4

•

8 years ago

I filed bug 1268328 about the inadequate crash reports.

Ralph Giles (:rillian)

Assignee

Comment 5

•

8 years ago

Thanks for the needinfo, dbaron.

The crash report in the description is on windows. There's some upstream discussion of the issue on mac in https://github.com/rust-lang/rust/issues/24346

It...kinda looks like we don't pass -g to rustc. I'll fix that.

Lars Bergstrom (:larsberg)

Comment 6

•

8 years ago

acrichto, have you seen panics on Windows in `print` before? From this call stack, it looks like we're panicing in `print`, Rust unrolls up to the boundary, it turns into an exception, and then that is taking us down.

Flags: needinfo?(acrichton)

Alex Crichton [:acrichto]

Comment 7

•

8 years ago

Oh I thought this was the same as bug 1253202? Which is to say that this is a problem where SSE instructions were in the standard library, which has since been fixed by switching to i586-pc-windows-msvc which has SSE disabled.

If this *isn't* that problem, than I haven't personally seen panics in `print` before, but because the call stack goes directly from `print` to `RaiseException` I'd guess that a panic isn't happening but rather some other fault.

Flags: needinfo?(acrichton)

Lars Bergstrom (:larsberg)

Comment 8

•

8 years ago

Hrm, I thought it might be a rust panic because the crash reason is:
Crash Reason 	0x00525354 / 0x00000000

Which is coincidentally the exception code for rust panics:
https://github.com/rust-lang/rust/blob/c66d2380a810c9a2b3dbb4f93a830b101ee49cc2/src/libstd/sys/common/unwind/seh.rs#L64

Alex Crichton [:acrichto]

Comment 9

•

8 years ago

Aha yep that's a Rust panic! This may just be a "we failed to write to stdout" panic [1], and we may just need to tweak to ignore I/O errors during a panic. For example I suspect this line [2] is firing which is causing a panic-in-a-panic which provides subpar error reporting.


[1]: https://github.com/rust-lang/rust/blob/master/src/libstd/io/stdio.rs#L590-L619
[2]: https://github.com/rust-lang/rust/blob/3f65afa6946ee4a6b1d619aee69e1f638edfb662/src/libstd/io/stdio.rs#L617

Nathan Froyd [:froydnj]

Comment 10

•

8 years ago

AIUI, Windows GUI processes don't really have a notion of stdout unless you compile them in debug mode, so that might provide some insight into what's going on here.

Lars Bergstrom (:larsberg)

Comment 11

•

8 years ago

Alex, do you have a bug and can we get started on a fix for this? We're probably going to have to get this fix, get a new build of Rust, and push it out to the servers for the 48 train, as it sounds like we'll hit this crash in release when 48 gets to users otherwise.

Flags: needinfo?(acrichton)

Alex Crichton [:acrichto]

Comment 12

•

8 years ago

Hm, I'll try to analyze a bit more to see if we can figure out what's going on.

To be clear, this code is *already* panicking, so the process is already "crashing". According to that backtrace the function `mp4parse_read` initiate a panic. The backtrace then claims there's a frame of `_ZN3std2io5stdio6_print17h0bae26b2b3e9ac51E` (demangled as `std::io::stdio::_print::h...`) before `RaiseException` is finally called. The `RaiseException` function is used to implement panics on MSVC in Rust, and it is only called in one location [1] (not `_print`).

When debugging code on MSVC before I've noticed that backtraces tend to be particularly bad (unfortunately). We generate lots of `internal` symbols in the compiler (corresponding to `static` functions in C), which by default aren't resolved by the standard backtrace utilities on MSVC. I think the situation is helped when generating debug information, however.

What ends up happening, however, is that in the object file you have a bunch of functions laid out sequentially. Whenever a function is exported from a DLL, there'll be a bunch of unrelated functions exported after that (just sequentially in memory), and the backtrace will think that all those functions are the exported one. In other words, backtraces are basically incorrect on MSVC from what I've seen due to our aggressive usage of internal symbols.

So my best guess as to what's happening here is:

* The `mp4parse_read` function is panicking
* A Windows exception is getting raised via `RaiseException`
* A crash report is generated at this point
* The backtrace is lying saying that it's in the `_print` function, it's rather just that the function is an internal symbol and if you go lower in memory the first exported function is `_print`

Does that make sense? I don't believe there's something to change in the standard library, but I could be wrong! I do think, however, that debuginfo will probably help quite a bit here :)

[1]: https://github.com/rust-lang/rust/blob/3f65afa6946ee4a6b1d619aee69e1f638edfb662/src/libstd/sys/common/unwind/seh.rs#L88
The `_print` function is used to implement `print!` and `println!` macros, neither of which are called by the implementations of panics [2].

[2]: https://github.com/rust-lang/rust/blob/3f65afa6946ee4a6b1d619aee69e1f638edfb662/src/libstd/panicking.rs#L152

Flags: needinfo?(acrichton)

Nicholas Nethercote [inactive]

Reporter

Comment 13

•

8 years ago

We've seen 134 of these in the past 7 days across FF 48 and 49.

It's possible that bug 1271529 comment 1 might be relevant here -- it explains malformed stack traces in some other Rust panics.

If we just had line numbers in the stack trace then we'd probably be able to tell where/how mp4parse is panicking, correct?

Alex Crichton [:acrichto]

Comment 14

•

8 years ago

With a crash like https://crash-stats.mozilla.com/report/index/562dbeeb-728e-4478-9a77-858da2160516 which appears to have line numbers in stack traces, it looks like it may not be the silver bullet here unfortunately. I'm not sure how well PDB/LLVM deal with inlining information, so the line number may be getting lost through inlining.

In the case of inlining losing filename/line number, the next best option would be a better introspection of the panic *message*, which could be done through either:

* catch_unwind - http://doc.rust-lang.org/nightly/std/panic/fn.catch_unwind.html
* set_hook - http://doc.rust-lang.org/nightly/std/panic/fn.set_hook.html

The latter may look like it gives file/line information, but it'll just tell you the same result.rs:687 rather than where the panic came from in your code itself.

Nathan Froyd [:froydnj]

Comment 15

•

8 years ago

OK, so groveling around in the minidump for https://crash-stats.mozilla.com/report/index/fe3a0851-557e-4ef8-be95-4c1782160417, I think what happens is that we're here:

http://dxr.mozilla.org/mozilla-central/source/media/libstagefright/binding/mp4parse/capi.rs#121

and we panic because thread::spawn assumes that thread creation succeeds.  Rust threads take 2MB stacks by default (https://github.com/rust-lang/rust/blob/master/src/libstd/sys/common/util.rs#L18) and we have 100MB of virtual memory free.  We ought to be able to make this allocation succeed, but experience suggests that Windows can have trouble fulfilling large requests like that.  (I thought we took about:memory snapshots, which would be able to tell us the max contiguous block of virtual memory, but maybe I'm thinking of something else or about:memory reporting didn't kick in here, because Rust crashes are different?)

Spawning a thread just to handle mp4 parsing seems heavyweight, but if we're going to do that, can we can least make thread creation fallible by using:

  let task = thread::Builder::new().spawn(...);

and unwrapping |task| safely to get at something which we can then wait on?

Ralph Giles (:rillian)

Assignee

Comment 16

•

8 years ago

Wow, thanks for the analysis, Nathan. I hadn't considered it could be OOM because of the thread stack.

The thread doesn't offer any concurrency benefit here. It was just about protecting users from panics in the my-first-rust-code of the mp4 parser. That's mostly cleaned up, although there are still a few assert!()s, and we're planning to use panic=abort anyway, so the best thing is to remove this thread like we did for mp4parse_get_track_info().

Assignee: nobody → giles

Flags: needinfo?(giles)

Alex Crichton [:acrichto]

Comment 17

•

8 years ago

Nice! I think that'd definitely do it in terms of causing a panic. I forgot to mention earlier as well, but I'd highly recommend capturing the panic message as well as it should in theory contain the OS error message which would point to OOM. That'd at least provide another vector for tracking down panics in Rust!

Ralph Giles (:rillian)

Assignee

Comment 18

•

8 years ago

Attached file MozReview Request: Bug 1266309 - Use fallible rust thread::Builder. r?kinetik — Details

We think thread::spawn() may be panicking in low memory
conditions. Test this by using the fallible thread::Builder
API and converting spawn Results into an error return.

Review commit: https://reviewboard.mozilla.org/r/55232/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/55232/

Attachment #8756545 - Flags: review?(kinetik)

Matthew Gregan [:kinetik]

Updated

•

8 years ago

Attachment #8756545 - Flags: review?(kinetik) → review+

Matthew Gregan [:kinetik]

Comment 19

•

8 years ago

Comment on attachment 8756545 [details]
MozReview Request: Bug 1266309 - Use fallible rust thread::Builder. r?kinetik

https://reviewboard.mozilla.org/r/55232/#review51922

Pulsebot

Comment 20

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/3b6fb13e56de

Nicholas Nethercote [inactive]

Reporter

Comment 21

•

8 years ago

Great work, Nathan.

(In reply to Alex Crichton [:acrichto] from comment #17)
> Nice! I think that'd definitely do it in terms of causing a panic. I forgot
> to mention earlier as well, but I'd highly recommend capturing the panic
> message as well as it should in theory contain the OS error message which
> would point to OOM. That'd at least provide another vector for tracking down
> panics in Rust!

I filed bug 1275780 for this. froydnj, want to give it a go?

Ryan VanderMeulen [:RyanVM]

Comment 22

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3b6fb13e56de

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox49: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla49

Ralph Giles (:rillian)

Assignee

Comment 23

•

8 years ago

[Tracking Requested - why for this release]: This fix applies equally well to 48, which is also seeing crashes. I'd like to request uplift once we've confirmed this fixes the issue for 49 nightly.

The crash reports we have are all on win32, so I think the problem is specific to that target. The rust code isn't enabled on windows for 47 or earlier, so they are unaffected.

status-firefox47: --- → unaffected

tracking-firefox48: --- → ?

Ralph Giles (:rillian)

Assignee

Comment 24

•

8 years ago

Looks like this didn't get everything. I see new signatures since this landed last week. e.g. https://crash-stats.mozilla.com/report/index/c29fc408-a68d-4fe5-a9dd-6df502160530

Nicholas Nethercote [inactive]

Reporter

Comment 25

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #15)
> we panic because thread::spawn assumes that thread creation succeeds. 
> Rust threads take 2MB stacks by default

Are we going to potentially have this problem every place we end up calling thread::spawn? Would it make sense to shrink the default thread stack size?

Flags: needinfo?(giles)

Ralph Giles (:rillian)

Assignee

Comment 26

•

8 years ago

It's probably safer to limit the stack size when we create threads, like we do in some gecko C++ code. Rust provides the stack_size() method on thread::Builder for this.

Flags: needinfo?(giles)

Ralph Giles (:rillian)

Assignee

Comment 27

•

8 years ago

Comment on attachment 8756545 [details]
MozReview Request: Bug 1266309 - Use fallible rust thread::Builder. r?kinetik

In comment #24 I forgot to check the version number. Crashes on 49 with this signature stopped May 25 after the patch landed. Requesting uplift to 48.

Approval Request Comment
[Feature/regressing bug #]: 1161350
[User impact if declined]: Crashes playing video in low memory conditions.
[Describe test coverage new/current, TreeHerder]: Landed on m-c last week.
[Risks and why]: Risk is low. Patch adds a check for thread creation failure and fall back to the C++ mp4 demuxer instead of crashing.
[String/UUID change made/needed]: None.

Attachment #8756545 - Flags: approval-mozilla-aurora?

Nicholas Nethercote [inactive]

Reporter

Comment 28

•

8 years ago

(In reply to Ralph Giles (:rillian) needinfo me from comment #26)
> It's probably safer to limit the stack size when we create threads, like we
> do in some gecko C++ code. Rust provides the stack_size() method on
> thread::Builder for this.

Great. If we do that then this bug's patch could be reverted, correct? Should I file a bug for limiting the stack size?

Sylvestre Ledru [:Sylvestre]

Comment 29

•

8 years ago

Comment on attachment 8756545 [details]
MozReview Request: Bug 1266309 - Use fallible rust thread::Builder. r?kinetik

Fix a crash, taking it.

Attachment #8756545 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Carsten Book [:Tomcat]

Comment 30

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/bb81bf646c3d

status-firefox48: affected → fixed

[:philipp]

Comment 31

•

8 years ago

hi, crashes with this signature are still occurring in 48.0b1 unfortunately.

[:philipp]

Comment 32

•

8 years ago

it is currently the #20 crash in 48.0b1 with 0.65% of all crashes, so i'm reopening this bug to make sure this isn't going under the radar.

Status: RESOLVED → REOPENED

Flags: needinfo?(giles)

Keywords: regression

Resolution: FIXED → ---

[:philipp]

Updated

•

8 years ago

status-firefox48: fixed → affected

status-firefox49: fixed → ?

Lars Bergstrom (:larsberg)

Comment 33

•

8 years ago

From a hand inspection of the final 40 or so crashes (is there a better way?), the 48a1 crashes have crashes in both `mp4parse_read` and `mp4parse_get_track_info`, but the 48b1 crashes are only in `mp4parse_get_track_info`. When should the attached fix (which should have fixed `mp4parse_read`) have gone out?

And, if this is just a crash in `mp4parse_get_track_info` for this signature now, is there any possibility that one of the `unwrap()` calls or something else in there is at fault? At a first glance, they all appear to be guarded by `is_some()` checks further up. Or is there another possible source of panics here?
https://dxr.mozilla.org/mozilla-central/source/media/libstagefright/binding/mp4parse/capi.rs#157-199

The only other thing that I can imagine is that we're hitting one of the `assert!` calls here:
https://dxr.mozilla.org/mozilla-central/source/media/libstagefright/binding/mp4parse/lib.rs#1258-1267

(In Rust, `assert!` is also present in release https://doc.rust-lang.org/std/macro.assert!.html, whereas `debug_assert!` is not https://doc.rust-lang.org/std/macro.debug_assert!.html)

Ralph Giles (:rillian)

Assignee

Comment 34

•

8 years ago

The fix on this bug for `mp4parse_read` was uplifted to 48 while that branch was still on aurora, so it should be on all beta builds.

If this is still an oom crash from the thread spawn, and limited to `mp4parse_get_track_info` on 48 then this should be fixed by bug 1275189. I failed to request uplift for that change while 48 was on aurora. I've now done so for beta.

Per the analysis there, the unwrap()s in `mp4parse_get_track_info` are all guarded. We've further audited the lib.rs asserts! today and don't think those can be triggered either.

Flags: needinfo?(giles)

[:philipp]

Comment 35

•

8 years ago

the signature is gone again in 48.0b3 crash data after the fix in bug 1275189 landed there. thank you!

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

status-firefox48: affected → fixed

status-firefox49: ? → fixed

tracking-firefox48: ? → ---

Resolution: --- → FIXED