Closed Bug 1447133 Opened 7 years ago Closed 1 year ago

Raft of wildptr failures in many opus functions

Categories

(Core :: Audio/Video: Playback, defect, P2)

58 Branch
defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: jesup, Unassigned)

References

Details

(Keywords: crash, csectype-wildptr, sec-high)

Crash Data

This search: https://crash-stats.mozilla.com/search/?proto_signature=~opus&product=Firefox&version=61.0a1&version=60.0a1&version=60.0b&version=59.0.1&version=59.0b&version=59.0&version=58.0.2&version=58.0.1&version=58.0b&version=58.0&date=%3E%3D2017-12-15T04%3A29%3A24.000Z&date=%3C2018-03-15T12%3A29%3A24.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature has me worried; there are a raft of wildptr crashes (including EXEC and illegal instruction crashes, but no e5e5 UAF crashes) scattered across a wide range of opus decoding functions. The lack of a clear common source or crash location makes me think this is due to something happening in the framework calling Opus - freeing buffers opus is using, or something else. Surveying a smattering of them on the same version (for example quant_all_bands for 58.0.2) I don't even see duplicated line numbers for crashing. This implies something async pulling the rug out from under the decoder - freeing the buffers it's using? Trashing the stack?? Something weird with Promises? I don't think it's too likely to be an Opus bug, but that is possible. We have EXEC and illegal-instruction crashes here, plus a lot of WRITE errors -> sec-critical for now.
Crash Signature: [@ quant_all_bands] [@ celt_decode_with_ec] [@ clt_mdct_backward_c] [@ comb_filter] [@ kf_bfly5] [@ compute_allocation] [@ opus_decode_frame]
That signature list is just the highest-frequency ones from the search
Opus decoder is synchronous. It never creates a thread or dispatch a task. So promises used with it are just a wrapper, things are always done on the calling thread immediately and completed by the time the promise is returned. So the "something async being pulling the rug out" isn't a plausible scenario.
It is a pity that the crash reports got reset.. Can't tell when this started
It's not impossible that bit-flip errors could cause these - but I think that's pretty unlikely given what we see and the frequency. I'm looking in more detail at a few
To me, many of the crashes are consistent with the "end" field in the CELT state being corrupted. That would cause random junk to be interpreted as array lengths for the rest of the decoder.
(In reply to Jean-Marc Valin (:jmspeex) from comment #5) > To me, many of the crashes are consistent with the "end" field in the CELT > state being corrupted. That would cause random junk to be interpreted as > array lengths for the rest of the decoder. Where in the code and structures is this? are there any good ways to check this for sanity without excessive runtime cost? (And to verify if this is happening, either by forced assert/MOZ_RELEASE_ASSERT or by failing and logging to telemetry)
Flags: needinfo?(jmvalin)
I'm currently looking at making a significant fraction of the Opus assertions unconditional. Basically, I'm dividing the assertions into two groups: 1) The ones that are cheap and that are "provably true" (and for which failure is scary) 2) The ones that are either too expensive or for which failure would just result in bad quality. Assertions in category 1) could be enabled with a special --enable-hardening option that has a very small cost and is highly unlikely to cause a crash just because some complicated math didn't behave like it should have. In addition, I'm still considering adding a few checks on the state upon entering the decoder. I was at some point considering a full CRC, but that would be too costly. OTOH, checking that a few fields are sane would be cheap and effective at catching many issues.
Flags: needinfo?(jmvalin)
(In reply to Jean-Marc Valin (:jmspeex) from comment #7) > I'm currently looking at making a significant fraction of the Opus > assertions unconditional. Basically, I'm dividing the assertions into two > groups: > 1) The ones that are cheap and that are "provably true" (and for which > failure is scary) > 2) The ones that are either too expensive or for which failure would just > result in bad quality. > > Assertions in category 1) could be enabled with a special --enable-hardening > option that has a very small cost and is highly unlikely to cause a crash > just because some complicated math didn't behave like it should have. > > In addition, I'm still considering adding a few checks on the state upon > entering the decoder. I was at some point considering a full CRC, but that > would be too costly. OTOH, checking that a few fields are sane would be > cheap and effective at catching many issues. Is this done or still ongoing? Which bugs are tracking this work? Can we assign this one here to you then? ;)
Flags: needinfo?(jmvalin)
I landed about a dozen commits in the Opus master branch that enable most assestions when compiling with --enable-hardening (or defining ENABLE_HARDENING). There's also a bunch of extra sanity checks on the decoder state which should catch most/all state corruption errors. I also have another patch that's not landed in master and that attempts to check that the pointers passed to the decoder do not alias each other, causing corruption during the decoding process (e.g. if the pcm buffer overlaps the state, then the decoder could overwrite its own state by writing the output audio). That patch is at: https://jmvalin.ca/misc_stuff/aliasing_check3.patch I wouldn't pull this in for a release, but I think it's save enough for nightly/beta.
Flags: needinfo?(jmvalin)
Yes, this sounds super interesting. I think this could be very useful for our fuzzers. CCing decoder.
I have a copy of Jean-Marc's patches that apply to inbound (without importing a new version of Opus). They pass Try. https://treeherder.mozilla.org/#/jobs?repo=try&revision=23bd98a4b4e93c437b06390001514480e232f74d (oranges there appear to be unrelated to this patchset)
(In reply to Randell Jesup [:jesup] from comment #11) > I have a copy of Jean-Marc's patches that apply to inbound (without > importing a new version of Opus). They pass Try. > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=23bd98a4b4e93c437b06390001514480e232f74d (oranges > there appear to be unrelated to this patchset) The patch appears to create quite a few oranges. It is also not clear to me how we would import a patch into nightly and beta only? Or is the patch only intended to help with fuzzing?
Assignee: nobody → jmvalin
Rank: 5
Flags: needinfo?(jmvalin)
Priority: -- → P1
I'm not sure what info you'd like from me.
Flags: needinfo?(jmvalin)
Nils, could you clarify your question for Jean-Marc?
Flags: needinfo?(drno)
Tyson are you by any chance fuzzing Opus already and could integrate Jean-Marc's patch from comment #9 to see if that results in any reproducible issues?
Flags: needinfo?(drno) → needinfo?(twsmith)
Jean-Marc how far is an Opus 1.3 release out? Or do you recommend to update to the latest HEAD instead now?
Flags: needinfo?(jmvalin)
I'm still doing stabilization towards a final 1.3 release. That being said, there's nothing in the current master that's a known regression compared to 1.2.1. So I'd say it's safe for beta, but it might be worth doing a bit more testing before putting it in a full release.
Flags: needinfo?(jmvalin)
(In reply to Nils Ohlmeier [:drno] from comment #15) > Tyson are you by any chance fuzzing Opus already and could integrate > Jean-Marc's patch from comment #9 to see if that results in any reproducible > issues? I'm not actively fuzzing it since it is in OSS-Fuzz. But I'll have a look. I'll try to have something running over the weekend.
Flags: needinfo?(twsmith)
(In reply to Tyson Smith [:tsmith] from comment #18) > I'll try to have something running over the weekend. I fuzzed opus with the attached patch applied and it did not turn up any issues.
Thanks Tyson. Then I would suggest we do an upgrade of libopus to current master and check if the new asserts will give us more information. I suggest we then use ENABLE_HARDENING only in Nightly and Beta and check for that gets us.
Depends on: 1471148
Is there more followup or investigation we can do here?
Flags: needinfo?(drno)
Flags: needinfo?(drno)
Dan can I ask you to do an update of libopus from upstream in a separate non-sec bug and enabled the as described in comment #20.
Flags: needinfo?(dminor)
Flags: needinfo?(dminor)
Did hardening find anything BTW?
(In reply to Jean-Marc Valin (:jmspeex) from comment #24) > Did hardening find anything BTW? We need to ship it first; none of these have known steps-to-reproduce to let us test
There are still crashes showing up with these signatures in 64 beta, though we shipped an updated version of libopus in 64 in bug 1487049.
Are these assert failures now? Any link to 64 crashes?
I'm seeing crazy things like an invalid write on the line "if(ftb>EC_UINT_BITS){" and an invalid read on "total_bits = len*8;" where one is a local variable and the other is an argument to the function. An EXEC violation calling a static C function in the same file as the caller. Maybe it's memory damage, but in at least one case I saw the same install crash with the same nonsense stack half a dozen times in a row. Not sure what to call this but it does't look like a "sec-critical" we can make progress on. Ted: is there anything we can do here short of something like bug 639684 or bug 634726, and is there any plan to actually do something along those lines?
Flags: needinfo?(ted)
Keywords: sec-critical
It would really help if I could have access to some stack data. Things like the value of the arguments passed through the stack trace. Otherwise, I'm going blind and there's indeed little change I'd find anything. My bet is still on "outside" code corrupting something in Opus, but I'd be good to confirm either way.
(In reply to Jean-Marc Valin (:jmspeex) from comment #30) > It would really help if I could have access to some stack data. Things like > the value of the arguments passed through the stack trace. Otherwise, I'm > going blind and there's indeed little change I'd find anything. My bet is > still on "outside" code corrupting something in Opus, but I'd be good to > confirm either way. You should request raw dump access for Socorro, that will let you download minidump files which will give you access to stack memory. For Windows dumps you'll be able to open them in a debugger.
Flags: needinfo?(ted)
(In reply to Daniel Veditz [:dveditz] from comment #29) > Ted: is there anything we can do here short of something like bug 639684 or > bug 634726, and is there any plan to actually do something along those lines? I don't know of anyone actively working on those things, but heycam does have a proof of concept tool to check whether instructions in the minidump match the binaries they came from, bug 1274628 has some info on that. If these crashes are due to instructions being corrupted that could help.
Keywords: stalled

Removing employee no longer with company from CC list of private bugs.

I'm no longer with Mozilla. OTOH, I'm still the Opus maintainer.

The severity field for this bug is set to normal. However, the bug is flagged with the sec-high keyword.
:jmvalin, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmvalin)

Is anyone actively pursuing this?

Flags: needinfo?(jmvalin)
Depends on: 1762614
Blocks: 1762642

I've filed bugs to update opus and make it auto-update; but I can't speak to the investigation of the crashes.

Crash Signature: [@ quant_all_bands] [@ celt_decode_with_ec] [@ clt_mdct_backward_c] [@ comb_filter] [@ kf_bfly5] [@ compute_allocation] [@ opus_decode_frame] → [@ quant_all_bands] [@ celt_decode_with_ec] [@ clt_mdct_backward_c] [@ comb_filter] [@ kf_bfly5] [@ compute_allocation] [@ opus_decode_frame] [@ opus_encode_native ] [@ opus_fft_impl ] [@ abort | celt_fatal ]

The bug assignee is inactive on Bugzilla, and this bug has priority 'P1'.
:jimm, could you have a look please?

For more information, please visit auto_nag documentation.

Assignee: jmvalin → nobody
Flags: needinfo?(jmathies)
Severity: normal → S3

(In reply to Tom Ritter [:tjr] from comment #37)

I've filed bugs to update opus and make it auto-update; but I can't speak to the investigation of the crashes.

Hey Tom, curious, how can I check for recent updates via updatebot? Is there some sort of log or dashboard someplace?

We have updatebot configured for libopus -

https://searchfox.org/mozilla-central/source/media/libopus/moz.yaml

If I search bugzilla for update related bugs though I can't find any recent pulls / posts / attempts since you updated it back in March -

https://bugzilla.mozilla.org/buglist.cgi?query_format=advanced&list_id=16248901&short_desc_type=allwordssubstr&short_desc=Update%20libopus&classification=Client%20Software&classification=Developer%20Infrastructure&classification=Components&classification=Server%20Software&classification=Other&resolution=FIXED&resolution=INVALID&resolution=WONTFIX&resolution=INACTIVE&resolution=DUPLICATE&resolution=WORKSFORME&resolution=INCOMPLETE&resolution=SUPPORT&resolution=EXPIRED&resolution=MOVED

Seems like updatebot should have pulled recent changes in? (Looking at the repo, there have been updates since the last time we pulled the lib in.)

As for this bug, I think I'll break the various signatures that still show up into separate sec bugs for tracking.

Reviewing stalled bugs - significantly fewer crashes on record, going to close this incomplete, and if a pattern emerges we can open a new bug to dig into it.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → INCOMPLETE

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled
Group: media-core-security
You need to log in before you can comment on or make changes to this bug.