Closed Bug 1266309 Opened 8 years ago Closed 8 years ago

crash in _ZN3std2io5stdio6_print17h0bae26b2b3e9ac51E

Categories

(Core :: Audio/Video: Playback, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox47 --- unaffected
firefox48 --- fixed
firefox49 --- fixed

People

(Reporter: n.nethercote, Assigned: rillian)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-fe3a0851-557e-4ef8-be95-4c1782160417.
=============================================================

Rust crashes! I see seven with this signature in the past week, with the first one occurring in Nightly 20160415030231.

> 0 	kernelbase.dll 	RaiseException 	
> 1 	xul.dll 	_ZN3std2io5stdio6_print17h0bae26b2b3e9ac51E 	
> 2 	xul.dll 	_ZN3std10sys_common6unwind18begin_unwind_inner17hf40013815c474ad4E 	
> 3 	xul.dll 	_ZN3std10sys_common6unwind16begin_unwind_fmt17h42a3b0f6333a2a21E 	
> 4 	xul.dll 	rust_begin_unwind 	
> 5 	xul.dll 	mp4parse_read 	
> 6 	xul.dll 	mp4_demuxer::MP4MetadataRust::MP4MetadataRust(mp4_demuxer::Stream*) 	media/libstagefright/binding/MP4Metadata.cpp
> 7 	xul.dll 	mp4_demuxer::MP4Metadata::MP4Metadata(mp4_demuxer::Stream*) 	media/libstagefright/binding/MP4Metadata.cpp
> 8 	xul.dll 	mozilla::MP4Demuxer::Init() 	dom/media/fmp4/MP4Demuxer.cpp
> 9 	xul.dll 	mozilla::MediaFormatReader::AsyncReadMetadata() 	dom/media/MediaFormatReader.cpp
> 10 	xul.dll 	mozilla::detail::ProxyRunnable<mozilla::MozPromise<RefPtr<mozilla::MetadataHolder>, mozilla::ReadMetadataFailureReason, 1>, mozilla::MediaDecoderReader>::Run() 	obj-firefox/dist/include/mozilla/MozPromise.h

Here are frames 1--3 demangled:

> std::io::stdio::_print::h0bae26b2b3e9ac51
> std::sys_common::unwind::begin_unwind_inner::hf40013815c474ad4
> std::sys_common::unwind::begin_unwind_fmt::h42a3b0f6333a2a21
rillian, can you please investigate?

Also, it's unfortunate that the Rust frames lack three important things: demangled names, line numbers, and links to source code. Do you know why? Are there bugs open to improve that?
> Also, it's unfortunate that the Rust frames lack three important things:
> demangled names, line numbers, and links to source code. Do you know why?

It doesn't even have the source file name.
Ted has also been working on some Oxidation stuff, but I'm not sure if that's more build or crash reporting stuff. But he might have some idea about improved crash reporting for Rust.
I filed bug 1268328 about the inadequate crash reports.
Thanks for the needinfo, dbaron.

The crash report in the description is on windows. There's some upstream discussion of the issue on mac in https://github.com/rust-lang/rust/issues/24346

It...kinda looks like we don't pass -g to rustc. I'll fix that.
acrichto, have you seen panics on Windows in `print` before? From this call stack, it looks like we're panicing in `print`, Rust unrolls up to the boundary, it turns into an exception, and then that is taking us down.
Flags: needinfo?(acrichton)
Oh I thought this was the same as bug 1253202? Which is to say that this is a problem where SSE instructions were in the standard library, which has since been fixed by switching to i586-pc-windows-msvc which has SSE disabled.

If this *isn't* that problem, than I haven't personally seen panics in `print` before, but because the call stack goes directly from `print` to `RaiseException` I'd guess that a panic isn't happening but rather some other fault.
Flags: needinfo?(acrichton)
Hrm, I thought it might be a rust panic because the crash reason is:
Crash Reason 	0x00525354 / 0x00000000

Which is coincidentally the exception code for rust panics:
https://github.com/rust-lang/rust/blob/c66d2380a810c9a2b3dbb4f93a830b101ee49cc2/src/libstd/sys/common/unwind/seh.rs#L64
Aha yep that's a Rust panic! This may just be a "we failed to write to stdout" panic [1], and we may just need to tweak to ignore I/O errors during a panic. For example I suspect this line [2] is firing which is causing a panic-in-a-panic which provides subpar error reporting.


[1]: https://github.com/rust-lang/rust/blob/master/src/libstd/io/stdio.rs#L590-L619
[2]: https://github.com/rust-lang/rust/blob/3f65afa6946ee4a6b1d619aee69e1f638edfb662/src/libstd/io/stdio.rs#L617
AIUI, Windows GUI processes don't really have a notion of stdout unless you compile them in debug mode, so that might provide some insight into what's going on here.
Alex, do you have a bug and can we get started on a fix for this? We're probably going to have to get this fix, get a new build of Rust, and push it out to the servers for the 48 train, as it sounds like we'll hit this crash in release when 48 gets to users otherwise.
Flags: needinfo?(acrichton)
Hm, I'll try to analyze a bit more to see if we can figure out what's going on.

To be clear, this code is *already* panicking, so the process is already "crashing". According to that backtrace the function `mp4parse_read` initiate a panic. The backtrace then claims there's a frame of `_ZN3std2io5stdio6_print17h0bae26b2b3e9ac51E` (demangled as `std::io::stdio::_print::h...`) before `RaiseException` is finally called. The `RaiseException` function is used to implement panics on MSVC in Rust, and it is only called in one location [1] (not `_print`).

When debugging code on MSVC before I've noticed that backtraces tend to be particularly bad (unfortunately). We generate lots of `internal` symbols in the compiler (corresponding to `static` functions in C), which by default aren't resolved by the standard backtrace utilities on MSVC. I think the situation is helped when generating debug information, however.

What ends up happening, however, is that in the object file you have a bunch of functions laid out sequentially. Whenever a function is exported from a DLL, there'll be a bunch of unrelated functions exported after that (just sequentially in memory), and the backtrace will think that all those functions are the exported one. In other words, backtraces are basically incorrect on MSVC from what I've seen due to our aggressive usage of internal symbols.

So my best guess as to what's happening here is:

* The `mp4parse_read` function is panicking
* A Windows exception is getting raised via `RaiseException`
* A crash report is generated at this point
* The backtrace is lying saying that it's in the `_print` function, it's rather just that the function is an internal symbol and if you go lower in memory the first exported function is `_print`

Does that make sense? I don't believe there's something to change in the standard library, but I could be wrong! I do think, however, that debuginfo will probably help quite a bit here :)

[1]: https://github.com/rust-lang/rust/blob/3f65afa6946ee4a6b1d619aee69e1f638edfb662/src/libstd/sys/common/unwind/seh.rs#L88
The `_print` function is used to implement `print!` and `println!` macros, neither of which are called by the implementations of panics [2].

[2]: https://github.com/rust-lang/rust/blob/3f65afa6946ee4a6b1d619aee69e1f638edfb662/src/libstd/panicking.rs#L152
Flags: needinfo?(acrichton)
We've seen 134 of these in the past 7 days across FF 48 and 49.

It's possible that bug 1271529 comment 1 might be relevant here -- it explains malformed stack traces in some other Rust panics.

If we just had line numbers in the stack trace then we'd probably be able to tell where/how mp4parse is panicking, correct?
With a crash like https://crash-stats.mozilla.com/report/index/562dbeeb-728e-4478-9a77-858da2160516 which appears to have line numbers in stack traces, it looks like it may not be the silver bullet here unfortunately. I'm not sure how well PDB/LLVM deal with inlining information, so the line number may be getting lost through inlining.

In the case of inlining losing filename/line number, the next best option would be a better introspection of the panic *message*, which could be done through either:

* catch_unwind - http://doc.rust-lang.org/nightly/std/panic/fn.catch_unwind.html
* set_hook - http://doc.rust-lang.org/nightly/std/panic/fn.set_hook.html

The latter may look like it gives file/line information, but it'll just tell you the same result.rs:687 rather than where the panic came from in your code itself.
OK, so groveling around in the minidump for https://crash-stats.mozilla.com/report/index/fe3a0851-557e-4ef8-be95-4c1782160417, I think what happens is that we're here:

http://dxr.mozilla.org/mozilla-central/source/media/libstagefright/binding/mp4parse/capi.rs#121

and we panic because thread::spawn assumes that thread creation succeeds.  Rust threads take 2MB stacks by default (https://github.com/rust-lang/rust/blob/master/src/libstd/sys/common/util.rs#L18) and we have 100MB of virtual memory free.  We ought to be able to make this allocation succeed, but experience suggests that Windows can have trouble fulfilling large requests like that.  (I thought we took about:memory snapshots, which would be able to tell us the max contiguous block of virtual memory, but maybe I'm thinking of something else or about:memory reporting didn't kick in here, because Rust crashes are different?)

Spawning a thread just to handle mp4 parsing seems heavyweight, but if we're going to do that, can we can least make thread creation fallible by using:

  let task = thread::Builder::new().spawn(...);

and unwrapping |task| safely to get at something which we can then wait on?
Wow, thanks for the analysis, Nathan. I hadn't considered it could be OOM because of the thread stack.

The thread doesn't offer any concurrency benefit here. It was just about protecting users from panics in the my-first-rust-code of the mp4 parser. That's mostly cleaned up, although there are still a few assert!()s, and we're planning to use panic=abort anyway, so the best thing is to remove this thread like we did for mp4parse_get_track_info().
Assignee: nobody → giles
Flags: needinfo?(giles)
Nice! I think that'd definitely do it in terms of causing a panic. I forgot to mention earlier as well, but I'd highly recommend capturing the panic message as well as it should in theory contain the OS error message which would point to OOM. That'd at least provide another vector for tracking down panics in Rust!
We think thread::spawn() may be panicking in low memory
conditions. Test this by using the fallible thread::Builder
API and converting spawn Results into an error return.

Review commit: https://reviewboard.mozilla.org/r/55232/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/55232/
Attachment #8756545 - Flags: review?(kinetik)
Attachment #8756545 - Flags: review?(kinetik) → review+
Comment on attachment 8756545 [details]
MozReview Request: Bug 1266309 - Use fallible rust thread::Builder. r?kinetik

https://reviewboard.mozilla.org/r/55232/#review51922
Great work, Nathan.

(In reply to Alex Crichton [:acrichto] from comment #17)
> Nice! I think that'd definitely do it in terms of causing a panic. I forgot
> to mention earlier as well, but I'd highly recommend capturing the panic
> message as well as it should in theory contain the OS error message which
> would point to OOM. That'd at least provide another vector for tracking down
> panics in Rust!

I filed bug 1275780 for this. froydnj, want to give it a go?
https://hg.mozilla.org/mozilla-central/rev/3b6fb13e56de
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
[Tracking Requested - why for this release]: This fix applies equally well to 48, which is also seeing crashes. I'd like to request uplift once we've confirmed this fixes the issue for 49 nightly.

The crash reports we have are all on win32, so I think the problem is specific to that target. The rust code isn't enabled on windows for 47 or earlier, so they are unaffected.
Looks like this didn't get everything. I see new signatures since this landed last week. e.g. https://crash-stats.mozilla.com/report/index/c29fc408-a68d-4fe5-a9dd-6df502160530
(In reply to Nathan Froyd [:froydnj] from comment #15)
> we panic because thread::spawn assumes that thread creation succeeds. 
> Rust threads take 2MB stacks by default

Are we going to potentially have this problem every place we end up calling thread::spawn? Would it make sense to shrink the default thread stack size?
Flags: needinfo?(giles)
It's probably safer to limit the stack size when we create threads, like we do in some gecko C++ code. Rust provides the stack_size() method on thread::Builder for this.
Flags: needinfo?(giles)
Comment on attachment 8756545 [details]
MozReview Request: Bug 1266309 - Use fallible rust thread::Builder. r?kinetik

In comment #24 I forgot to check the version number. Crashes on 49 with this signature stopped May 25 after the patch landed. Requesting uplift to 48.

Approval Request Comment
[Feature/regressing bug #]: 1161350
[User impact if declined]: Crashes playing video in low memory conditions.
[Describe test coverage new/current, TreeHerder]: Landed on m-c last week.
[Risks and why]: Risk is low. Patch adds a check for thread creation failure and fall back to the C++ mp4 demuxer instead of crashing.
[String/UUID change made/needed]: None.
Attachment #8756545 - Flags: approval-mozilla-aurora?
(In reply to Ralph Giles (:rillian) needinfo me from comment #26)
> It's probably safer to limit the stack size when we create threads, like we
> do in some gecko C++ code. Rust provides the stack_size() method on
> thread::Builder for this.

Great. If we do that then this bug's patch could be reverted, correct? Should I file a bug for limiting the stack size?
Comment on attachment 8756545 [details]
MozReview Request: Bug 1266309 - Use fallible rust thread::Builder. r?kinetik

Fix a crash, taking it.
Attachment #8756545 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
hi, crashes with this signature are still occurring in 48.0b1 unfortunately.
it is currently the #20 crash in 48.0b1 with 0.65% of all crashes, so i'm reopening this bug to make sure this isn't going under the radar.
Status: RESOLVED → REOPENED
Flags: needinfo?(giles)
Keywords: regression
Resolution: FIXED → ---
From a hand inspection of the final 40 or so crashes (is there a better way?), the 48a1 crashes have crashes in both `mp4parse_read` and `mp4parse_get_track_info`, but the 48b1 crashes are only in `mp4parse_get_track_info`. When should the attached fix (which should have fixed `mp4parse_read`) have gone out?

And, if this is just a crash in `mp4parse_get_track_info` for this signature now, is there any possibility that one of the `unwrap()` calls or something else in there is at fault? At a first glance, they all appear to be guarded by `is_some()` checks further up. Or is there another possible source of panics here?
https://dxr.mozilla.org/mozilla-central/source/media/libstagefright/binding/mp4parse/capi.rs#157-199

The only other thing that I can imagine is that we're hitting one of the `assert!` calls here:
https://dxr.mozilla.org/mozilla-central/source/media/libstagefright/binding/mp4parse/lib.rs#1258-1267

(In Rust, `assert!` is also present in release https://doc.rust-lang.org/std/macro.assert!.html, whereas `debug_assert!` is not https://doc.rust-lang.org/std/macro.debug_assert!.html)
The fix on this bug for `mp4parse_read` was uplifted to 48 while that branch was still on aurora, so it should be on all beta builds.

If this is still an oom crash from the thread spawn, and limited to `mp4parse_get_track_info` on 48 then this should be fixed by bug 1275189. I failed to request uplift for that change while 48 was on aurora. I've now done so for beta.

Per the analysis there, the unwrap()s in `mp4parse_get_track_info` are all guarded. We've further audited the lib.rs asserts! today and don't think those can be triggered either.
Flags: needinfo?(giles)
the signature is gone again in 48.0b3 crash data after the fix in bug 1275189 landed there. thank you!
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: