1447133 - Raft of wildptr failures in many opus functions

Reporter

Description

•

7 years ago

This search: https://crash-stats.mozilla.com/search/?proto_signature=~opus&product=Firefox&version=61.0a1&version=60.0a1&version=60.0b&version=59.0.1&version=59.0b&version=59.0&version=58.0.2&version=58.0.1&version=58.0b&version=58.0&date=%3E%3D2017-12-15T04%3A29%3A24.000Z&date=%3C2018-03-15T12%3A29%3A24.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature has me worried; there are a raft of wildptr crashes (including EXEC and illegal instruction crashes, but no e5e5 UAF crashes) scattered across a wide range of opus decoding functions. The lack of a clear common source or crash location makes me think this is due to something happening in the framework calling Opus - freeing buffers opus is using, or something else. Surveying a smattering of them on the same version (for example quant_all_bands for 58.0.2) I don't even see duplicated line numbers for crashing. This implies something async pulling the rug out from under the decoder - freeing the buffers it's using? Trashing the stack?? Something weird with Promises? I don't think it's too likely to be an Opus bug, but that is possible. We have EXEC and illegal-instruction crashes here, plus a lot of WRITE errors -> sec-critical for now.

Randell Jesup [:jesup] (needinfo me)

Reporter

Updated

•

7 years ago

Crash Signature: [@ quant_all_bands] [@ celt_decode_with_ec] [@ clt_mdct_backward_c] [@ comb_filter] [@ kf_bfly5] [@ compute_allocation] [@ opus_decode_frame]

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 1

•

7 years ago

That signature list is just the highest-frequency ones from the search

Jean-Yves Avenard [:jya]

Comment 2

•

7 years ago

Opus decoder is synchronous. It never creates a thread or dispatch a task. So promises used with it are just a wrapper, things are always done on the calling thread immediately and completed by the time the promise is returned. So the "something async being pulling the rug out" isn't a plausible scenario.

Jean-Yves Avenard [:jya]

Comment 3

•

7 years ago

It is a pity that the crash reports got reset.. Can't tell when this started

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 4

•

7 years ago

It's not impossible that bit-flip errors could cause these - but I think that's pretty unlikely given what we see and the frequency. I'm looking in more detail at a few

Jean-Marc Valin

Comment 5

•

7 years ago

To me, many of the crashes are consistent with the "end" field in the CELT state being corrupted. That would cause random junk to be interpreted as array lengths for the rest of the decoder.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 6

•

7 years ago

(In reply to Jean-Marc Valin (:jmspeex) from comment #5) > To me, many of the crashes are consistent with the "end" field in the CELT > state being corrupted. That would cause random junk to be interpreted as > array lengths for the rest of the decoder. Where in the code and structures is this? are there any good ways to check this for sanity without excessive runtime cost? (And to verify if this is happening, either by forced assert/MOZ_RELEASE_ASSERT or by failing and logging to telemetry)

Flags: needinfo?(jmvalin)

Jean-Marc Valin

Comment 7

•

7 years ago

I'm currently looking at making a significant fraction of the Opus assertions unconditional. Basically, I'm dividing the assertions into two groups: 1) The ones that are cheap and that are "provably true" (and for which failure is scary) 2) The ones that are either too expensive or for which failure would just result in bad quality. Assertions in category 1) could be enabled with a special --enable-hardening option that has a very small cost and is highly unlikely to cause a crash just because some complicated math didn't behave like it should have. In addition, I'm still considering adding a few checks on the state upon entering the decoder. I was at some point considering a full CRC, but that would be too costly. OTOH, checking that a few fields are sane would be cheap and effective at catching many issues.

Flags: needinfo?(jmvalin)

Frederik Braun [:freddy]

Comment 8

•

7 years ago

(In reply to Jean-Marc Valin (:jmspeex) from comment #7) > I'm currently looking at making a significant fraction of the Opus > assertions unconditional. Basically, I'm dividing the assertions into two > groups: > 1) The ones that are cheap and that are "provably true" (and for which > failure is scary) > 2) The ones that are either too expensive or for which failure would just > result in bad quality. > > Assertions in category 1) could be enabled with a special --enable-hardening > option that has a very small cost and is highly unlikely to cause a crash > just because some complicated math didn't behave like it should have. > > In addition, I'm still considering adding a few checks on the state upon > entering the decoder. I was at some point considering a full CRC, but that > would be too costly. OTOH, checking that a few fields are sane would be > cheap and effective at catching many issues. Is this done or still ongoing? Which bugs are tracking this work? Can we assign this one here to you then? ;)

Flags: needinfo?(jmvalin)

Jean-Marc Valin

Comment 9

•

7 years ago

I landed about a dozen commits in the Opus master branch that enable most assestions when compiling with --enable-hardening (or defining ENABLE_HARDENING). There's also a bunch of extra sanity checks on the decoder state which should catch most/all state corruption errors. I also have another patch that's not landed in master and that attempts to check that the pointers passed to the decoder do not alias each other, causing corruption during the decoding process (e.g. if the pcm buffer overlaps the state, then the decoder could overwrite its own state by writing the output audio). That patch is at: https://jmvalin.ca/misc_stuff/aliasing_check3.patch I wouldn't pull this in for a release, but I think it's save enough for nightly/beta.

Flags: needinfo?(jmvalin)

Frederik Braun [:freddy]

Comment 10

•

7 years ago

Yes, this sounds super interesting. I think this could be very useful for our fuzzers. CCing decoder.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 11

•

7 years ago

I have a copy of Jean-Marc's patches that apply to inbound (without importing a new version of Opus). They pass Try. https://treeherder.mozilla.org/#/jobs?repo=try&revision=23bd98a4b4e93c437b06390001514480e232f74d (oranges there appear to be unrelated to this patchset)

Nils Ohlmeier [:drno]

Comment 12

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #11) > I have a copy of Jean-Marc's patches that apply to inbound (without > importing a new version of Opus). They pass Try. > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=23bd98a4b4e93c437b06390001514480e232f74d (oranges > there appear to be unrelated to this patchset) The patch appears to create quite a few oranges. It is also not clear to me how we would import a patch into nightly and beta only? Or is the patch only intended to help with fuzzing?

Assignee: nobody → jmvalin

Rank: 5

Flags: needinfo?(jmvalin)

Priority: -- → P1

Jean-Marc Valin

Comment 13

•

7 years ago

I'm not sure what info you'd like from me.

Flags: needinfo?(jmvalin)

Daniel Veditz [:dveditz]

Comment 14

•

7 years ago

Nils, could you clarify your question for Jean-Marc?

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Comment 15

•

7 years ago

Tyson are you by any chance fuzzing Opus already and could integrate Jean-Marc's patch from comment #9 to see if that results in any reproducible issues?

Flags: needinfo?(drno) → needinfo?(twsmith)

Nils Ohlmeier [:drno]

Comment 16

•

7 years ago

Jean-Marc how far is an Opus 1.3 release out? Or do you recommend to update to the latest HEAD instead now?

Flags: needinfo?(jmvalin)

Jean-Marc Valin

Comment 17

•

7 years ago

I'm still doing stabilization towards a final 1.3 release. That being said, there's nothing in the current master that's a known regression compared to 1.2.1. So I'd say it's safe for beta, but it might be worth doing a bit more testing before putting it in a full release.

Flags: needinfo?(jmvalin)

Tyson Smith [:tsmith]

Comment 18

•

7 years ago

(In reply to Nils Ohlmeier [:drno] from comment #15) > Tyson are you by any chance fuzzing Opus already and could integrate > Jean-Marc's patch from comment #9 to see if that results in any reproducible > issues? I'm not actively fuzzing it since it is in OSS-Fuzz. But I'll have a look. I'll try to have something running over the weekend.

Flags: needinfo?(twsmith)

Tyson Smith [:tsmith]

Comment 19

•

7 years ago

(In reply to Tyson Smith [:tsmith] from comment #18) > I'll try to have something running over the weekend. I fuzzed opus with the attached patch applied and it did not turn up any issues.

Nils Ohlmeier [:drno]

Comment 20

•

7 years ago

Thanks Tyson. Then I would suggest we do an upgrade of libopus to current master and check if the new asserts will give us more information. I suggest we then use ENABLE_HARDENING only in Nightly and Beta and check for that gets us.

Christian Holler (:decoder)

Updated

•

7 years ago

Depends on: 1471148

Liz Henry (:lizzard) (relman/hg->git project)

Comment 21

•

7 years ago

Is there more followup or investigation we can do here?

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Updated

•

6 years ago

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Comment 22

•

6 years ago

Dan can I ask you to do an update of libopus from upstream in a separate non-sec bug and enabled the as described in comment #20.

Flags: needinfo?(dminor)

Dan Minor [:dminor]

Comment 23

•

6 years ago

Filed Bug 1487049.

Flags: needinfo?(dminor)

Jean-Marc Valin

Comment 24

•

6 years ago

Did hardening find anything BTW?

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 25

•

6 years ago

(In reply to Jean-Marc Valin (:jmspeex) from comment #24) > Did hardening find anything BTW? We need to ship it first; none of these have known steps-to-reproduce to let us test

Liz Henry (:lizzard) (relman/hg->git project)

Comment 26

•

6 years ago

There are still crashes showing up with these signatures in 64 beta, though we shipped an updated version of libopus in 64 in bug 1487049.

Jean-Marc Valin

Comment 27

•

6 years ago

Are these assert failures now? Any link to 64 crashes?

Nils Ohlmeier [:drno]

Comment 28

•

6 years ago

Daniel Veditz [:dveditz]

Comment 29

•

6 years ago

I'm seeing crazy things like an invalid write on the line "if(ftb>EC_UINT_BITS){" and an invalid read on "total_bits = len*8;" where one is a local variable and the other is an argument to the function. An EXEC violation calling a static C function in the same file as the caller. Maybe it's memory damage, but in at least one case I saw the same install crash with the same nonsense stack half a dozen times in a row. Not sure what to call this but it does't look like a "sec-critical" we can make progress on. Ted: is there anything we can do here short of something like bug 639684 or bug 634726, and is there any plan to actually do something along those lines?

Flags: needinfo?(ted)

Keywords: sec-critical

Jean-Marc Valin

Comment 30

•

6 years ago

It would really help if I could have access to some stack data. Things like the value of the arguments passed through the stack trace. Otherwise, I'm going blind and there's indeed little change I'd find anything. My bet is still on "outside" code corrupting something in Opus, but I'd be good to confirm either way.

(not currently active) Ted Mielczarek

Comment 31

•

6 years ago

(In reply to Jean-Marc Valin (:jmspeex) from comment #30) > It would really help if I could have access to some stack data. Things like > the value of the arguments passed through the stack trace. Otherwise, I'm > going blind and there's indeed little change I'd find anything. My bet is > still on "outside" code corrupting something in Opus, but I'd be good to > confirm either way. You should request raw dump access for Socorro, that will let you download minidump files which will give you access to stack memory. For Windows dumps you'll be able to open them in a debugger.

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 32

•

6 years ago

(In reply to Daniel Veditz [:dveditz] from comment #29) > Ted: is there anything we can do here short of something like bug 639684 or > bug 634726, and is there any plan to actually do something along those lines? I don't know of anyone actively working on those things, but heycam does have a proof of concept tool to check whether instructions in the minidump match the binaries they came from, bug 1274628 has some info on that. If these crashes are due to instructions being corrupted that could help.

Al Billings [:abillings - ex-MoCo]

Updated

•

6 years ago

Keywords: sec-high

Jean-Yves Avenard [:jya]

Updated

•

6 years ago

Keywords: stalled

David Lawrence [:dkl]

Comment 33

•

5 years ago

Removing employee no longer with company from CC list of private bugs.

Jean-Marc Valin

Comment 34

•

5 years ago

I'm no longer with Mozilla. OTOH, I'm still the Opus maintainer.

BugBot [:suhaib / :marco/ :calixte]

Comment 35

•

3 years ago

The severity field for this bug is set to normal. However, the bug is flagged with the sec-high keyword.
:jmvalin, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmvalin)

Jean-Marc Valin

Comment 36

•

3 years ago

Is anyone actively pursuing this?

Flags: needinfo?(jmvalin)

Tom Ritter [:tjr]

Updated

•

3 years ago

Depends on: 1762614

Tom Ritter [:tjr]

Updated

•

3 years ago

Blocks: 1762642

Tom Ritter [:tjr]

Comment 37

•

3 years ago

I've filed bugs to update opus and make it auto-update; but I can't speak to the investigation of the crashes.

Daniel Veditz [:dveditz]

Updated

•

3 years ago

Crash Signature: [@ quant_all_bands] [@ celt_decode_with_ec] [@ clt_mdct_backward_c] [@ comb_filter] [@ kf_bfly5] [@ compute_allocation] [@ opus_decode_frame] → [@ quant_all_bands] [@ celt_decode_with_ec] [@ clt_mdct_backward_c] [@ comb_filter] [@ kf_bfly5] [@ compute_allocation] [@ opus_decode_frame] [@ opus_encode_native ] [@ opus_fft_impl ] [@ abort | celt_fatal ]

BugBot [:suhaib / :marco/ :calixte]

Comment 38

•

2 years ago

The bug assignee is inactive on Bugzilla, and this bug has priority 'P1'.
:jimm, could you have a look please?

For more information, please visit auto_nag documentation.

Assignee: jmvalin → nobody

Flags: needinfo?(jmathies)

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Jim Mathies [:jimm]

Comment 39

•

2 years ago

•

Edited

(In reply to Tom Ritter [:tjr] from comment #37)

I've filed bugs to update opus and make it auto-update; but I can't speak to the investigation of the crashes.

Hey Tom, curious, how can I check for recent updates via updatebot? Is there some sort of log or dashboard someplace?

We have updatebot configured for libopus -

https://searchfox.org/mozilla-central/source/media/libopus/moz.yaml

If I search bugzilla for update related bugs though I can't find any recent pulls / posts / attempts since you updated it back in March -

https://bugzilla.mozilla.org/buglist.cgi?query_format=advanced&list_id=16248901&short_desc_type=allwordssubstr&short_desc=Update%20libopus&classification=Client%20Software&classification=Developer%20Infrastructure&classification=Components&classification=Server%20Software&classification=Other&resolution=FIXED&resolution=INVALID&resolution=WONTFIX&resolution=INACTIVE&resolution=DUPLICATE&resolution=WORKSFORME&resolution=INCOMPLETE&resolution=SUPPORT&resolution=EXPIRED&resolution=MOVED

Seems like updatebot should have pulled recent changes in? (Looking at the repo, there have been updates since the last time we pulled the lib in.)

As for this bug, I think I'll break the various signatures that still show up into separate sec bugs for tracking.

Jim Mathies [:jimm]

Comment 40

•

2 years ago

Five of these signature are still showing up on crashstats in Fx version 100 or higher. Very low volume in each case.

https://crash-stats.mozilla.org/signature/?signature=opus_encode_native&date=%3E%3D2022-07-27T12%3A42%3A00.000Z&date=%3C2022-10-27T12%3A42%3A00.000Z
https://crash-stats.mozilla.org/signature/?signature=opus_fft_impl&date=%3E%3D2022-07-27T12%3A42%3A00.000Z&date=%3C2022-10-27T12%3A42%3A00.000Z
https://crash-stats.mozilla.org/signature/?signature=abort%20%7C%20celt_fatal
https://crash-stats.mozilla.org/signature/?signature=quant_all_bands&date=%3E%3D2022-07-27T12%3A40%3A00.000Z&date=%3C2022-10-27T12%3A40%3A00.000Z
https://crash-stats.mozilla.org/signature/?signature=celt_decode_with_ec&date=%3E%3D2022-07-27T12%3A40%3A00.000Z&date=%3C2022-10-27T12%3A40%3A00.000Z

Flags: needinfo?(jmathies)

Priority: P1 → P2

Tom Ritter [:tjr]

Comment 41

•

1 year ago

Reviewing stalled bugs - significantly fewer crashes on record, going to close this incomplete, and if a pattern emerges we can open a new bug to dig into it.

Status: NEW → RESOLVED

Closed: 1 year ago

Resolution: --- → INCOMPLETE

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 42

•

1 year ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled

Daniel Veditz [:dveditz]

Updated

•

1 year ago

Group: media-core-security