1584956 - Crash in [@ mp4parse_get_indice_table]

Reporter

Description

•

5 years ago

This bug is for crash report bp-60ae907e-2bbe-4f46-b44d-176980190929.

Top 10 frames of crashing thread:

0 XUL gkrust_shared::panic_hook mfbt/Assertions.h:332
1 XUL core::ops::function::Fn::call src/libcore/ops/function.rs:69
2 XUL std::panicking::rust_panic_with_hook src/libstd/panicking.rs:481
3 XUL std::panicking::continue_panic_fmt src/libstd/panicking.rs:384
4 XUL rust_begin_unwind src/libstd/panicking.rs:311
5 XUL core::panicking::panic_fmt src/libcore/panicking.rs:85
6 XUL core::panicking::panic_bounds_check src/libcore/panicking.rs:61
7 XUL mp4parse_get_indice_table src/libcore/slice/mod.rs:2681
8 XUL mozilla::MP4Metadata::GetTrackIndice dom/media/mp4/MP4Metadata.cpp:444
9 XUL mozilla::MP4Demuxer::Init dom/media/mp4/MP4Demuxer.cpp:181

Not a new crash since reports go back all the way to version 63.0. Definitely mac-specific as I couldn't find crashes with this signature on other platforms. The raw crash reason points to an out-of-bounds array access in Rust.

Alex Chronopoulos [:achronop]

Updated

•

5 years ago

Priority: -- → P3

Rob Wu [:robwu]

Comment 1

•

4 years ago

I came across this thread on Reddit and looked up the crash: https://old.reddit.com/r/firefox/comments/iiuofl/why_does_firefox_crash_every_time_i_try_to_open_a/

The frequency of crashes increased significantly 2 months ago, around the time of the 78 release. The top cause of the crash is ESR78:
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=mp4parse_get_indice_table&date=%3E%3D2019-12-09T21%3A53%3A00.000Z&date=%3C2020-09-09T21%3A53%3A00.000Z#aggregations

I skimmed through the reports, and the MozCrashReason seems to always be like
Index out of bounds: the len is [some 5-6 digit number] but the index is 16565899579919558117

The "index" is 0xe5e5e5e5e5e5e5e5, which looks like a jemalloc's poison value.
This suggests that the crash is caused by a use-after-free, through mp4parse_get_indice_table in MP4Metadata::GetTrackIndice.

Bryce, could you check if this bug report is actionable given this extra information?

Flags: needinfo?(bvandyk)

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 2

•

4 years ago

•

Edited

Thanks for the analysis!

Recapping:

This crash is largely MacOS specific.
~~Large spike for ESR78, and the majority of crashes are on ESR78.~~ This spike is likely due to us moving old release users to ESR and release having throttling on reporting while ESR doesn't. So the numbers are inflated while the rate may not be increasing.
The MacOS crashes are all (or a vast majority) crashing due to using a value that has been poisoned.

I'm going to make this secure while I investigate. As noted above, touching poison is of concern.

Group: core-security

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 3

•

4 years ago

I don't see any crashes yet in 82. It's possible this is due to small sample size, but another consideration is we've been landing fixes to issues found via fuzzing. I do not see anything that would clearly fix this, I will continue investigating.

NI :jbauman, does this look like anything fuzzing has caught?

Flags: needinfo?(jbauman)

Andrew McCreight [:mccr8]

Updated

•

4 years ago

Group: core-security → media-core-security

Andrew McCreight [:mccr8]

Updated

•

4 years ago

Keywords: csectype-uaf, sec-high

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Updated

•

4 years ago

Updated

•

4 years ago

Flags: needinfo?(bvandyk)

Jon Bauman [:jbauman:]

Comment 4

•

4 years ago

I've landed a number of fuzzing fixes in mp4parse-rust, but nothing in response to a crash quite like this. Here are the most relevant changes:

However, if it's a use after free with mp4parse_get_indice_table, I'd be inclined to think it's more likely an issue in the C++ code calling mp4parse_free and then mp4parse_get_indice_table, or possibly calling mp4parse_get_indice_table when mp4parse_new wasn't successful. The only place mp4parse_free is referenced is here, so I'll poke around a bit and see if I see anything obvious.

Jon Bauman [:jbauman:]

Comment 5

•

4 years ago

Based on the stack, we must be here in MP4Demuxer::Init. We can assume that mp4parse_new was successful, because if it wasn't we'd have returned an "Parse MP4 metadata failed" error earlier in Init. Furthermore, the only way mp4parse_free gets called is via the FreeMP4Parser implementation which is a destructor template argument to UniquePtr (which I assume is correct). Given where the metadata local variable (which contains the parser instance) is declared and lacking any calls that would destruct it before the call to metadata.GetTrackIndice(), I'm having a hard time seeing how there would be a use-after-free here.

Does that analysis seem right to you, Bryce? Do you see another way this could be happening?

Flags: needinfo?(jbauman)

Gabriele Svelto [:gsvelto]

Reporter

Comment 6

•

4 years ago

FYI I poked a few minidumps and found something odd but maybe worthy of note: the stack walker often resorts to stack scanning after the mozilla::MP4Metadata::GetTrackIndice(unsigned int) frame. That's usually a sign that the stack has been corrupted as we can't find the frame-pointer for the previous call. I found the exact same thing in bug 1654335 so there's a chance both issues may have a common (or at least similar) cause.

Jon Bauman [:jbauman:]

Comment 7

•

4 years ago

Corrupt stack or not, the clues point pretty strongly at a use-after-free, but I'm having a hard time seeing how that would occur with the code the way it is.

What do I need to do to get access to view bug 1654335?

Steven Michaud [:smichaud] (Retired)

Comment 8

•

4 years ago

What do I need to do to get access to view bug 1654335?

I just CCed you on the bug.

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Updated

•

4 years ago

Comment 9

•

4 years ago

•

Edited

I've learnt that crash reporting in release and ESR works differently and that release only reports 10% of crashes, while ESR reports them all. Since we migrated a large number of users from old MacOS release builds to ESR, the uptick we see in crash stats is likely inflated. I've updated comment 2 to reflect this.

After looking I've added what appears to be a related sig that is catching Windows crashes. Some of the crashes there look like the MacOS ones with us panicing due to bad lengths, but there appear to be other issues, so I'm not sure they're all the same root cause. I'm focusing on looking at the ones that seem like bad indexing for now.

If MacOS dumps can be debugged locally, I don't know how, but I can work with a Windows dump. Stack of a crash from https://crash-stats.mozilla.org/report/index/d3fc0715-27b9-463c-9d86-4e7100200908 (assuming correct stack walking + symbolization):

0:051> k
  *** Stack trace for last set context - .thread/.cxr resets it
 # Child-SP          RetAddr               Call Site
00 (Inline Function) --------`--------     xul!MOZ_Crash+0xa [/builds/worker/workspace/obj-build/dist/include/mozilla/Assertions.h @ 332] 
01 00000075`dc43e7e0 00000271`261369a5     xul!RustMozCrash+0x10 [/builds/worker/checkouts/gecko/mozglue/static/rust/wrappers.cpp @ 16] 
02 00000075`dc43e810 00000271`2613692c     xul!mozglue_static::panic_hook+0x75 [/builds/worker/checkouts/gecko/mozglue/static/rust/lib.rs @ 89] 
03 (Inline Function) --------`--------     xul!core::ops::function::FnOnce::call_once+0x5 [/rustc/4fb7144ed159f94491249e86d5bbd033b5d60550\src\libcore\ops\function.rs @ 232] 
04 00000075`dc43ec60 00000271`2628dcb1     xul!core::ops::function::FnOnce::call_once<fn(core::panic::PanicInfo*),(core::panic::PanicInfo*)>+0xc [/rustc/4fb7144ed159f94491249e86d5bbd033b5d60550\src\libcore\ops\function.rs @ 72] 
05 00000075`dc43ec90 00000271`26293b2a     xul!std::panicking::rust_panic_with_hook+0x1a1 [/rustc/4fb7144ed159f94491249e86d5bbd033b5d60550\/src\libstd\panicking.rs @ 477] 
06 00000075`dc43ed90 00000271`25f2df10     xul!std::panicking::begin_panic_handler+0x4a [/rustc/4fb7144ed159f94491249e86d5bbd033b5d60550\/src\libstd\panicking.rs @ 378] 
07 00000075`dc43ede0 00000271`25f2ddaa     xul!std::panicking::begin_panic_fmt+0x30 [/rustc/4fb7144ed159f94491249e86d5bbd033b5d60550\/src\libcore\panicking.rs @ 85] 
08 00000075`dc43ee30 00000271`26158f60     xul!core::panicking::panic_bounds_check+0x7a [/rustc/4fb7144ed159f94491249e86d5bbd033b5d60550\/src\libcore\panicking.rs @ 63] 
09 (Inline Function) --------`--------     xul!mp4parse_capi::create_sample_table+0x1a2e [/builds/worker/checkouts/gecko/third_party/rust/mp4parse_capi/src/lib.rs @ 1450] 
0a (Inline Function) --------`--------     xul!mp4parse_capi::get_indice_table+0x1b98 [/builds/worker/checkouts/gecko/third_party/rust/mp4parse_capi/src/lib.rs @ 1139] 
0b 00000075`dc43eec0 00000271`246e51c6     xul!mp4parse_capi::mp4parse_get_indice_table+0x1bd0 [/builds/worker/checkouts/gecko/third_party/rust/mp4parse_capi/src/lib.rs @ 1096] 
0c 00000075`dc43f060 00000271`246e32c2     xul!mozilla::MP4Metadata::GetTrackIndice+0x76 [/builds/worker/checkouts/gecko/dom/media/mp4/MP4Metadata.cpp @ 446] 
0d 00000075`dc43f150 00000271`244da32f     xul!mozilla::MP4Demuxer::Init+0xb52 [/builds/worker/checkouts/gecko/dom/media/mp4/MP4Demuxer.cpp @ 181] 
0e (Inline Function) --------`--------     xul!mozilla::MediaFormatReader::DemuxerProxy::Init::<unnamed-tag>::operator()+0x62 [/builds/worker/checkouts/gecko/dom/media/MediaFormatReader.cpp @ 728] 
0f 00000075`dc43f3b0 00000271`22901390     xul!mozilla::detail::ProxyFunctionRunnable<`lambda at /builds/worker/checkouts/gecko/dom/media/MediaFormatReader.cpp:723:22',mozilla::MozPromise<mozilla::MediaResult,mozilla::MediaResult,1> >::Run+0x7f [/builds/worker/workspace/obj-build/dist/include/mozilla/MozPromise.h @ 1565] 
10 00000075`dc43f410 00000271`2290c846     xul!mozilla::TaskQueue::Runner::Run+0x180 [/builds/worker/checkouts/gecko/xpcom/threads/TaskQueue.cpp @ 165] 
11 00000075`dc43f4b0 00000271`21fd3dde     xul!nsThreadPool::Run+0x686 [/builds/worker/checkouts/gecko/xpcom/threads/nsThreadPool.cpp @ 301] 
12 00000075`dc43f650 00000271`21fd2c21     xul!nsThread::ProcessNextEvent+0xcae [/builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp @ 1239] 
13 (Inline Function) --------`--------     xul!NS_ProcessNextEvent+0x20 [/builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp @ 513] 
14 00000075`dc43f840 00000271`22dbdc78     xul!mozilla::ipc::MessagePumpForNonMainThreads::Run+0x181 [/builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp @ 332] 
15 (Inline Function) --------`--------     xul!MessageLoop::RunInternal+0xf [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 334] 
16 00000075`dc43f8c0 00000271`21fab871     xul!MessageLoop::RunHandler+0x28 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 328] 
17 00000075`dc43f910 00000271`2290785d     xul!MessageLoop::Run+0x51 [/builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc @ 310] 
18 00000075`dc43f960 00007ff9`065af854     xul!nsThread::ThreadFunc+0x12d [/builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp @ 449] 
19 00000075`dc43fb20 00007ff9`0659fdba     nss3!_PR_NativeRunThread+0x144 [/builds/worker/checkouts/gecko/nsprpub/pr/src/threads/combined/pruthr.c @ 421] 
1a 00000075`dc43fba0 00007ff9`3e8a1542     nss3!pr_root+0xa [/builds/worker/checkouts/gecko/nsprpub/pr/src/md/windows/w95thred.c @ 140] 
1b 00000075`dc43fbd0 00000271`5ec3b9f0     ucrtbase+0x21542
1c 00000075`dc43fbd8 00000000`00000000     0x00000271`5ec3b9f0

i.e. something is going wrong with us indexing into the sample table[0]. It makes sense peek_index is our bad index and that something is going wrong earlier to cause this. This suggests sort_table contains invalid data. Could something be going wrong with the TryVec such that we end up with it being freed and then poisoned or full of garbage that leads to bad indexing?

My current two areas I plan to look more at:

Our fallible allocation. We've already switched out what we're doing upstream, I wonder if there's any other dangers in the old code that could lead to this.
The data we're getting from the mp4 and how it's being used here. We're using some data here to set sizes of containers. Is there anywhere that data could cause problems by being malformed/malicious (overflows, etc)?

[0] https://searchfox.org/mozilla-central/rev/eb9d5c97927aea75f0c8e38bbc5b5d288099e687/third_party/rust/mp4parse_capi/src/lib.rs#1611

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Updated

•

4 years ago

Crash Signature: [@ mp4parse_get_indice_table] → [@ mp4parse_get_indice_table] [@ mp4parse_capi::mp4parse_get_indice_table ]

Jon Bauman [:jbauman:]

Comment 10

•

4 years ago

Could something be going wrong with the TryVec such that we end up with it being freed and then poisoned or full of garbage that leads to bad indexing?

It's certainly possible, since there is some unsafe code involved. Nearly all of the unsafe code involved is in this one function if you'd like to take a look. Though this crash is happening with much older code, so I don't think that's likely relevant at all. Based on the last update of mp4parse-rust code before this bug was filed, create_sample_table would've been using a stdlib Vec with vec_push for fallible allocation coming from the 0.0.1 version of our mp4parse_fallible crate. In that case, the code to check is here.

It does seem like it would be very instructive to know whether we still see this crash on code after the switch to broader use of fallible allocation or after we switched from our home-grown implementation based on malloc/realloc to the fallible_collections implementation based on rust stdlib alloc API.

This update on May 9 marks the landing of the former, and should show up in 78
This update on Sep 2 marks the landing of the latter, and should show up in 82

Steven Michaud [:smichaud] (Retired)

Comment 11

•

4 years ago

•

Edited

It does seem like it would be very instructive to know whether we still see this crash on code after the switch to broader use of fallible allocation or after we switched from our home-grown implementation based on malloc/realloc to the fallible_collections implementation based on rust stdlib alloc API.

This update on May 9 marks the landing of the former, and should show up in 78

Apparently we do for at least this one update:

https://crash-stats.mozilla.org/search/?proto_signature=~mp4parse_get_indice_table&date=%3E%3D2020-08-11T20%3A39%3A00.000Z&date=%3C2020-09-11T20%3A39%3A00.000Z&_facets=signature&_facets=version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

This is still true even if we filter out all 0x0 crash addresses:

https://crash-stats.mozilla.org/search/?proto_signature=~mp4parse_get_indice_table&address=%210x0&date=%3E%3D2020-08-11T20%3A39%3A00.000Z&date=%3C2020-09-11T20%3A39%3A00.000Z&_facets=signature&_facets=version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

Steven Michaud [:smichaud] (Retired)

Comment 12

•

4 years ago

This is still true even if we filter out all 0x0 crash addresses:

But note that crash addresses > 128TB are changed to 0x0 on the Mac and Linux (and to 0xffffffffffffffff on Windows). This includes addresses that have been poisoned.

Bug 974420 comment 18.

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 13

•

4 years ago

Assigning to me so this has an owner. I don't think I can action this at this stage, but will keep eyes on it.

Assignee: nobody → bvandyk

Priority: P3 → P2

Gabriele Svelto [:gsvelto]

Reporter

Updated

•

4 years ago

Comment 14

•

4 years ago

(In reply to Bryce Seager van Dyk (:bryce) from comment #13)

Assigning to me so this has an owner. I don't think I can action this at this stage, but will keep eyes on it.

Hey Bryce, Based on your comment #13, this bug appears stalled to me. (So I'm marking it as such.) I'm needinfo'ing you for a "sanity-check" that marking it stalled makes sense to you. Thanks!

Flags: needinfo?(bvandyk)

Keywords: stalled

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 15

•

4 years ago

(In reply to Maire Reavy [:mreavy] from comment #14)

(In reply to Bryce Seager van Dyk (:bryce) from comment #13)

Assigning to me so this has an owner. I don't think I can action this at this stage, but will keep eyes on it.

Hey Bryce, Based on your comment #13, this bug appears stalled to me. (So I'm marking it as such.) I'm needinfo'ing you for a "sanity-check" that marking it stalled makes sense to you. Thanks!

I think that's a fair assessment. Until we figure out the root cause of bug 1665411 I don't think I can action this.

Flags: needinfo?(bvandyk)

Gabriele Svelto [:gsvelto]

Reporter

Updated

•

4 years ago

Depends on: 1676343

BugBot [:suhaib / :marco/ :calixte]

Comment 16

•

2 years ago

The bug assignee didn't login in Bugzilla in the last months and this bug has priority 'P2'/severity 'critical'.
:jimm, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: brycebugemail → nobody

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 17

•

2 years ago

https://crash-stats.mozilla.org/signature/?signature=mp4parse_get_indice_table

still present, but only on very old versions of osx. pretty high volume, so it'd be great to get it fixed. stalled at this point though until we find an owner who has the cycles.

Severity: critical → S4

Flags: needinfo?(jmathies)

Gabriele Svelto [:gsvelto]

Reporter

Comment 18

•

2 years ago

The volume is artificially amplified by the crashes happening on the ESR channel which isn't throttled. We're seeing all crashes and that's ~200 users per week which isn't too bad.

BugBot [:suhaib / :marco/ :calixte]

Comment 19

•

2 years ago

The severity field for this bug is set to S4. However, the bug is flagged with the sec-high keyword.
:jimm, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Updated

•

2 years ago

Flags: needinfo?(jmathies)

Ryan VanderMeulen [:RyanVM]

Comment 20

•

2 years ago

No recent reports from any modern versions.

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → WORKSFORME

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 21

•

2 years ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled

Daniel Veditz [:dveditz]

Updated

•

1 year ago

Group: media-core-security