Opus crash due to race [@opus_multistream_decode_native]

VERIFIED FIXED in Firefox 18

Status

()

defect
--
critical
VERIFIED FIXED
7 years ago
6 years ago

People

(Reporter: posidron, Assigned: rillian)

Tracking

(4 keywords)

Trunk
mozilla20
x86_64
All
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite -

Firefox Tracking Flags

(firefox17 wontfix, firefox18+ fixed, firefox19+ verified, firefox20+ verified, firefox-esr10 unaffected, firefox-esr1718+ verified, b2g18 fixed)

Details

(Whiteboard: [adv-main18+][adv-esr17+])

Attachments

(6 attachments, 3 obsolete attachments)

Reporter

Description

7 years ago
Posted audio testcase (obsolete) —
I am guessing that this crash is triggered by a race condition inside the media decoder state machine. To reproduce reload the testcase with cmd+r or f5 a few times. The crashing location will vary on each execution.

Tested with changeset: 114362:c63d5cff18ba
Reporter

Comment 1

7 years ago
Posted file callstack
Reporter

Comment 3

7 years ago
Marked as s-s because I have seen ASan reporting heap-buffer-overflow signatures as well.
Posted audio testcase
I ran the testcase through rogg_crcfix so it fails without needing to patch out the libogg crc verification.
Attachment #687099 - Attachment is obsolete: true

Comment 5

7 years ago
Can we get an assignee here?
I tried to reproduce this last week and failed, so assigning to rillian, who says he has managed to reproduce.
Assignee: nobody → giles
Posted file gc callstack
I can reproduce this easily with a normal build, no asan required, but I've not made much progress tracking this down. Here's another backtrace, from a segfault inside the garbage collector, in case that's helpful.

Any ideas how to pursue this?
Although this could be a more recent regression, for now I'm going to assume it's been there in Opus all along until we know better.
That will speed up my bisection. Thanks!
Oh, wait. Did you actually verify the problem, or just guess?
I did a besection, just testing a normal build with the crc-fixed testcase, and hitting reload 10-20 times to see if I could get it to segfault. The first bad commit was:

commit d682b69bcedcbbf041e9a2ba67bed073736217da (git)
Author: Alexandros Chronopoulos <achronop@gmail.com>
Date:   Thu Aug 9 15:53:23 2012 -0700

    Bug 748144 - Support multichannel Opus files. r=rillian

Which looks more helpful than I was expecting.
Another variation of the crash:

-919025920[7fff9c009dc0]: 2e300d0 Decoder playing 960 frames of data to stream for AudioData at 396000
Assertion failure: addr % CellSize == 0, at /home/giles/mozilla/firefox/js/src/gc/Heap.h:819
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff4922d45 in js::gc::Cell::address (this=0x7fff7800ada2)
    at /home/giles/mozilla/firefox/js/src/gc/Heap.h:819
819	    JS_ASSERT(addr % CellSize == 0);
Missing separate debuginfos, use: debuginfo-install alsa-plugins-pulseaudio-1.0.26-1.fc17.x86_64 gsm-1.0.13-6.fc17.x86_64 json-c-0.9-4.fc17.x86_64
(gdb) bt
#0  0x00007ffff4922d45 in js::gc::Cell::address (this=0x7fff7800ada2)
    at /home/giles/mozilla/firefox/js/src/gc/Heap.h:819
#1  0x00007ffff4922daa in js::gc::Cell::arenaHeader (this=0x7fff7800ada2)
    at /home/giles/mozilla/firefox/js/src/gc/Heap.h:935
#2  0x00007ffff4924c1c in js::gc::Cell::compartment (this=0x7fff7800ada2)
    at /home/giles/mozilla/firefox/js/src/gc/Heap.h:980
#3  0x00007ffff4c6a8dd in CheckMarkedThing<JSString> (trc=0x7fffffff5050, 
    thing=0x7fff7800ada2)
    at /home/giles/mozilla/firefox/js/src/gc/Marking.cpp:106
#4  0x00007ffff4c68da8 in MarkInternal<JSString> (trc=0x7fffffff5050, thingp=
    0x7fffcaa17268) at /home/giles/mozilla/firefox/js/src/gc/Marking.cpp:145
#5  0x00007ffff4c68365 in MarkUnbarriered<JSString> (trc=0x7fffffff5050, 
    thingp=0x7fffcaa17268, name=0x7ffff56e0ce7 "left child")
    at /home/giles/mozilla/firefox/js/src/gc/Marking.cpp:175
#6  0x00007ffff4c62b19 in js::gc::MarkStringUnbarriered (trc=0x7fffffff5050, 
    thingp=0x7fffcaa17268, name=0x7ffff56e0ce7 "left child")
    at /home/giles/mozilla/firefox/js/src/gc/Marking.cpp:317
...
Stack pointer seems to be getting corrupted somehow. When it crashes in the opus code, it's pretty commonly here:

614	   ptr = (char*)st + align(sizeof(OpusMSDecoder));
   0x00007ffff43c68cd <+111>:	mov    $0x10c,%edi
=> 0x00007ffff43c68d2 <+116>:	callq  0x7ffff43c53e8 <align>

and in some instances rsp points to unmapped memory, so the callq segfaults trying to push rip.
(In reply to Ralph Giles (:rillian) from comment #13)
> Stack pointer seems to be getting corrupted somehow. When it crashes in the
> opus code, it's pretty commonly here:

Okay, that was enough to start making some guesses. media/libopus/src/opus_multistream.c:613 reads
  ALLOC(buf, 2*frame_size, opus_val16)

frame_size is documented to have a maximum of 63*2880 in nsOggReader::DecodeOpus() (and this documentation is correct). 2*63*2880*sizeof(opus_val16) (which in this case is float, since we're not using a fixed-point build) is 1,451,520 bytes, or 1.38 MB. Our media thread stacks are 128 kB. So that's a problem.

This is obviously ridiculous: 63*2880 is 3.78 seconds of audio, and the maximum valid Opus packet size is 120 ms (giving a buffer size of 2*5760*4==45 kB here). It will get rejected by libopus later, but the damage to the stack has already been done.

The simple fix is to reject the packet earlier. The best place is before we allocate the buffer for the frame.
Attachment #690665 - Flags: review?(giles)
Comment on attachment 690665 [details] [diff] [review]
Don't try to decode Opus packets larger than 120 ms

Review of attachment 690665 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for your help with this, Tim. I'd seen the allocation was large, but didn't think about the thread's stack size being the problem.

Looks good to me, and I can't make it crash with this patch. Some comments below.

::: content/media/ogg/OggReader.cpp
@@ +424,5 @@
>    int32_t samples = opus_packet_get_samples_per_frame(aPacket->packet,
>                                                        (opus_int32) mOpusState->mRate);
>    int32_t frames = frames_number*samples;
>  
> +  // The maximum valid value is 5760 (120 ms).

I would be be more verbose here, e.g.

// Valid Opus packets must be no more than 120 ms,
// so reject any reporting more than 5760 samples.

@@ +425,5 @@
>                                                        (opus_int32) mOpusState->mRate);
>    int32_t frames = frames_number*samples;
>  
> +  // The maximum valid value is 5760 (120 ms).
> +  if (frames <= 0 || frames > 5760)

The <= zero check is an important guard for the subsequent allocation, and detects error returns from opus_packet_get_nb_frames, but I suppose we could tighten this to the minimum as well?

The minimum opus frame size is 2.5 ms, or 120 samples, so we could check for (frames < 120) instead.

Currently opus_packet_get_samples_per_frame() will never return a negative, so checking for frames < 0 is sufficient to detect a failure. But, get_nb_frames only returns an error if the buffer is short, in which case the subsequent get_samples_per_frame call could segfault reading the toc byte for the packetS o really we should check right immediately afterward.
Attachment #690665 - Flags: review?(giles) → review+
Christoph, can you please confirm this fixes your original testcase?
I believe I understand the crash in the opus code now, but I don't understand how this was causing the similar segfaults in js::gc. The jit does fancy things with the stack frames, but I would have thought the context switch in and out of the decode thread would just save and restore the invalid stack pointer without affecting other code.

Is it just that we managed to write into someone else's stack frame with our bogus pointer?
Validated test case file on Fedora Linux 17 x86_64 with official builds:

firefox-19.0a2: crashed
firefox-18.0b3: crashed
firefox-17.0.1: crashed
firefox-16.0.2: can't reproduce

Which is as expected. The call into the code with the stack overflow was added in ff17.

dveditz, why is firefox-17 wontfix? If we're indeed writing to the ionmonkey stack, couldn't this be exploitable?
OS: Mac OS X → All
Reporter

Comment 20

7 years ago
Have tested again with the applied patch but still crashes.
(In reply to Christoph Diehl [:cdiehl] from comment #20)
> Have tested again with the applied patch but still crashes.

Well, darn. We should still land Tim's fix though; that's a real bug.
Reporter

Comment 22

7 years ago
I am terrible sorry, I thought doing 'make obj/content/media' was enough.

The patch does fix the problem.
whew! thanks for retesting.
Tim, here's a patch implementing my comments from #16. Let me know if you like it better.
Attachment #691018 - Flags: review?(tterribe)
Comment on attachment 691018 [details] [diff] [review]
Reject packets with invalid lengths

Review of attachment 691018 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Ralph Giles (:rillian) from comment #18)
> Is it just that we managed to write into someone else's stack frame with our
> bogus pointer?

That seems the most likely to me.

(In reply to Ralph Giles (:rillian) from comment #19)
> dveditz, why is firefox-17 wontfix? If we're indeed writing to the ionmonkey
> stack, couldn't this be exploitable?

I think he just means that this should land on the ESR branch (once we request and get approval).

Anyway, patch LGTM.

::: content/media/ogg/OggReader.cpp
@@ +427,5 @@
>                                                        (opus_int32) mOpusState->mRate);
>    int32_t frames = frames_number*samples;
>  
> +  // Valid Opus packets must be no shorter than 2.5 ms
> +  // and no longer than 120 ms. Reject any reporting

It seems kind of odd to call out one case when we reject in two different cases.
Attachment #691018 - Flags: review?(tterribe) → review+
(In reply to Ralph Giles (:rillian) from comment #18)
> Currently opus_packet_get_samples_per_frame() will never return a negative,
> so checking for frames < 0 is sufficient to detect a failure. But,
> get_nb_frames only returns an error if the buffer is short, in which case
> the subsequent get_samples_per_frame call could segfault reading the toc
> byte for the packetS o really we should check right immediately afterward.

Good catch, BTW. I've known this in the half-dozen times I implemented this function (including GetOpusDeltaGP() in OggCodecState.cpp!), but managed to miss that here.
Comment on attachment 690665 [details] [diff] [review]
Don't try to decode Opus packets larger than 120 ms

Marking obsolete in favour of the newer patch.
Attachment #690665 - Attachment is obsolete: true
(In reply to Timothy B. Terriberry (:derf) from comment #25)

> > +  // Valid Opus packets must be no shorter than 2.5 ms
> > +  // and no longer than 120 ms. Reject any reporting
> 
> It seems kind of odd to call out one case when we reject in two different
> cases.

I don't understand what you mean here. I tried to describe both bounds checks.
(In reply to Ralph Giles (:rillian) from comment #28)

> I don't understand what you mean here. I tried to describe both bounds
> checks.

Oh, you meant the next line. Sorry. How about:

// Valid Opus packets must be no shorter than 2.5 ms
// and no longer than 120 ms. Reject any reporting
// a duration outside that range in samples.

Or is it clear enough without the second sentence?
(In reply to Ralph Giles (:rillian) from comment #29)
> Or is it clear enough without the second sentence?

Either way is fine.
Updated patch with improved comments. Carrying forward previous r+.
Attachment #691018 - Attachment is obsolete: true
Attachment #691399 - Flags: review+
Before this lands, be sure to ask for sec-approval on the patch and, ideally, I think we'd want branch patches for affected versions depending on the risk involved in taking it.
Thanks for the reminder.
Keywords: checkin-needed
Comment on attachment 691399 [details] [diff] [review]
Reject packets with invalid lengths

[Security approval request comment]
> How easily can the security issue be deduced from the patch?

The patch doesn't refer to the stack overflow, it just describes rejecting packets with invalid header data. There's also a reference to a potential--but non-exploitable--heap segfault fix which could mislead an attacker.

> Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

We are limiting the return values we use from a fairly simple function, so someone digging deeper could see what part of the data to play with mutating. I don't think that qualifies as a bulls-eye though; the stack overflow happens away from the site of this change.

> Which older supported branches are affected by this flaw?

Supported branches back to Firefox 17 are affected, when this code initially landed.

> If not all supported branches, which bug introduced the flaw?

Bug 748144.

> Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

I can post backports. They're straightforward; the code itself hasn't changed, but the enclosing class and file names have, so the patch must be adapted to that.

> How likely is this patch to cause regressions; how much testing does it need?

Risk of causing other problems is low. The patch simply rejects invalid data earlier in the parsing sequence.

To verify, check that the testcase from this bug does not crash the browser, and that the content/media/test/test_playback.html mochitest still passes.
Attachment #691399 - Flags: sec-approval?
Attachment #691399 - Flags: sec-approval? → sec-approval+
I'm giving this sec-approval+ for central. We should prepare branch patches and nominate them. I'm cc'ing release management so they can decide when they want to take this on what branches.
Comment on attachment 691399 [details] [diff] [review]
Reject packets with invalid lengths

Patch applies unmodified to the Aurora tree.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 748144.

User impact if declined:

Users will be exposed to a security issue: crashes possibly exploitable by malicious web content.

Testing completed (on m-c, etc.):

Passes try push on m-c. Verified patch fixes the crash on aurora.

Risk to taking this patch (and alternatives if risky):

Risk is low. The patch merely rejects invalid data earlier in the parsing sequence.

String or UUID changes made by this patch:

No changes, only interior code is affected.
Attachment #691399 - Flags: approval-mozilla-aurora?
Backported to Firefox 18 (current beta tree). This is the same code change, just updated for enclosing class and filename changes.

[Approval Request Comment]
> Bug caused by (feature/regressing bug #):

Bug 748144.

> User impact if declined: 

Users will be exposed to a security issue: crashes possibly exploitable by malicious web content.

> Testing completed (on m-c, etc.):

Passes try push on m-c. Verified patch fixes the crash on beta.

> Risk to taking this patch (and alternatives if risky):

Risk is low. The patch merely rejects invalid data earlier in the parsing sequence.

> String or UUID changes made by this patch: 

No changes, only interior code is affected.
Attachment #692455 - Flags: approval-mozilla-beta?
Approving patch for aurora/beta as it is low risk.Juan, can you please help with some testing here around opus/ogg to verify this patch ?
QA Contact: jbecerra
Attachment #691399 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #692455 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Thanks bajaj, Juan.
Comment on attachment 692455 [details] [diff] [review]
backport to earlier releases

This bug also affects Firefox 17. The ff18 patch applies to the esr17 branch and I've verified it resolves the issue and that the content/media mochitests are unaffected in my local testing.

[Approval Request Comment]
> If this is not a sec:{high,crit} bug, please state case for ESR consideration:

This is a sec-critical bug.

> User impact if declined:

Users will be exposed to crashes possibly exploitable by malicious web content.

> Fix Landed on Version:

We have approval for m-c, aurora, and beta. Waiting for coordination with release management before landing anywhere.

> Risk to taking this patch (and alternatives if risky):

Risk is low. The patch merely rejects invalid data earlier in the parsing sequence.

> String or UUID changes made by this patch:

No changes, only interior code is affected.

See https://wiki.mozilla.org/Release_Management/ESR_Landing_Process for more info.
Attachment #692455 - Flags: approval-mozilla-esr17?
Comment on attachment 692455 [details] [diff] [review]
backport to earlier releases

Feel free to land to all branches given your testing. QA will continue to test opus/ogg in parallel to beta 5, but we don't need to block the landing on this testing.
Attachment #692455 - Flags: approval-mozilla-esr17? → approval-mozilla-esr17+
I've spoken with Alex and Ralph about this bug and the feeling is that the risk to uplifting this to ESR is low. My suggestion is that QA coordinates exploratory testing around Opus in Beta5. If we find any issues, we can always backout of ESR and Beta at that time. However that remains a very low risk.
Taking QA Contact so I can coordinate some Opus testing.
QA Contact: jbecerra → anthony.s.hughes
Reporter

Comment 45

7 years ago
Btw.
It runs and will run also as a fuzzing job on my private and our future cloud servers.
Thanks everyone.
Keywords: checkin-needed
https://hg.mozilla.org/integration/mozilla-inbound/rev/746f31757355

Once this hits m-c, it can be uplifted to the release branches.
Flags: in-testsuite?
Keywords: checkin-needed
Since the crash isn't reliable I don't think I can make a useful testsuite entry for it. The steps to reproduce in the description are the best way to verify the bug.
Flags: in-testsuite? → in-testsuite-
https://hg.mozilla.org/mozilla-central/rev/746f31757355
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Looks like this got approval for Beta but didn't land in time. Can we make sure it gets landed for Beta 6?
I think we were waiting for inbound to cycle into nightly. What's the cut-off date for Beta 6?
After consultation with bajaj, the plan is to review this for beta6 in tomorrow's triage, and then land or not based on the decision there.

I'll go ahead and push to aurora.
Dropping qawanted since there's nothing for us to investigate. Will continue to track for verification.
Keywords: qawanted
I...have no idea how I messed up the flags on this bug. I thought I just set status-firefox-esr17 to fixed.

I've put back what I can, but I don't seem to have an option to set the tracking flags for ff18 and ff19 to '+'.
(In reply to Christoph Diehl [:cdiehl] from comment #0)
> Created attachment 687099 [details]
> testcase

Testcase failed to crash the following builds:
 * Firefox Nightly 20.0a1 2012-12-21
 * Firefox Aurora 19.0a2 2012-12-21
 * Firefox 17.0.1esrpre 2012-12-21

Will need to wait for 18.0b6 builds next week to verify for Fx18.
Whiteboard: [adv-main18+][adv-esr17+]
Group: core-security
You need to log in before you can comment on or make changes to this bug.