WebRTC crash [@VDCVideoNewPlugIn]

RESOLVED FIXED in Firefox 20

Status

()

P1
critical
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: posidron, Assigned: smichaud)

Tracking

(Blocks: 1 bug, {crash, sec-high, testcase})

Trunk
mozilla22
x86_64
Mac OS X
crash, sec-high, testcase
Points:
---
Bug Flags:
in-testsuite -

Firefox Tracking Flags

(firefox20+ fixed, firefox21+ fixed, firefox22 fixed, firefox-esr17 unaffected, b2g18 unaffected)

Details

(Whiteboard: [getUserMedia][blocking-gum+][qa-][adv-main20-], crash signature)

Attachments

(9 attachments, 3 obsolete attachments)

301 bytes, text/html
Details
5.31 KB, text/plain
Details
1.82 KB, text/plain
Details
23.44 KB, text/plain
Details
21.44 KB, text/plain
Details
2.89 KB, patch
Details | Diff | Splinter Review
844 bytes, application/x-gzip
Details
2.26 KB, application/x-gzip
Details
12.52 KB, patch
smichaud
: review+
Details | Diff | Splinter Review
(Reporter)

Description

6 years ago
Created attachment 709525 [details]
testcase

This seems to be a crash in the device driver of MacOS.

newton % sw_vers
ProductName:	Mac OS X
ProductVersion:	10.8.2
BuildVersion:	12C3006

Before the crash happens we can see in the console:

Sun Feb  3 22:32:42 2013 CMIO_Unit_Output_Base.cpp:197:RenderBus something has gone wrong!
Sun Feb  3 22:32:42 2013 CMIO_Unit_Output_Base.cpp:199:RenderBus 	idx = 0, mNumInputs = 1, mNumInputsAllocatedFor = 0
Sun Feb  3 22:32:42 2013 CMIO_Unit_Output_Base.cpp:201:RenderBus 	theInput = 0x1211dec80, mPullInputOnNextRound = 0x0, mInputHasSeenEndOfData = 0x0
[...]

The testcase will not crash in a debug/non-opt build.

Tested with m-i changeset: 120689:eae4b34eb792 and -O2
Tested with m-c changeset: 120354:2cc710018b14 and -O2
(Reporter)

Comment 1

6 years ago
Created attachment 709526 [details]
callstack
Very fast start/stop of video here....  Perhaps we're triggering an OS/Mac Foundation bug or bug in the webrtc.org code by stopping capture before anything is captured, or some such.

It's possible 3.20 will fix it.

A workaround may be to enforce a minimum 'on' period (or number of frames captured) before stop() takes effect, especially as it's known that debug builds don't crash.
Assignee: nobody → rjesup
Priority: -- → P1
Whiteboard: [getUserMedia][blocking-gum+][webrtc-uplift]
Flags: in-testsuite?
We're trying to ship this as part of FF 20. So noming for tracking.
tracking-firefox20: --- → ?
tracking-firefox21: --- → ?

Updated

6 years ago
status-firefox20: --- → affected
status-firefox21: --- → affected
tracking-firefox20: ? → +
tracking-firefox21: ? → +
(In reply to Randell Jesup [:jesup] from comment #2)
> It's possible 3.20 will fix it.

Christoph -- 3.20 just landed yesterday.  Can you retest and see if this is still happening?  Thanks.
Flags: needinfo?(cdiehl)
(Reporter)

Comment 5

6 years ago
Still works for me.
Flags: needinfo?(cdiehl)
(In reply to Christoph Diehl [:cdiehl] from comment #5)
> Still works for me.

Okay. Does it work for you on Aurora as well?
(Reporter)

Comment 7

6 years ago
With current Aurora we see the same error messages but no crash.
It does not look like a real ASan specific catch here, will test it later with an ASan build of Aurora though.
(Reporter)

Comment 8

6 years ago
I have tested this now with an ASan debug build of Aurora but it passes the testcase as well. The build I used is: https://people.mozilla.com/~choller/firefox/asan/20130214-mozilla-aurora-macosx64-debug-d083267a188a+asan.html
I currently have not enough disk space left to fetch the Aurora tree and make a optimized build out of it. All my previous builds are optimized and the testcases crashes in those builds successfully.
Let's close this as a works for me then. If we end up reproducing later with some other configuration, then let's reopen.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → WORKSFORME
My opt asan mac build just finished (inbound).  Second click on "crash" crashed with the same signature
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Jib is seeing 
Thu Feb 14 17:44:37 2013 CMIO_Graph.cpp:8839:HandleRenderNotify CMIOGraph::HandleRenderNotify() called but graph is not initialized!
http://pastebin.mozilla.org/2139860

where I'm crashing with the RenderBus: something has gone wrong! error.

He's running OSX 10.8.2, I'm on 10.7.  What are others running where the crash is seen or not seen in an opt asan build on Mac?

Could be:
a) OS bug fixed in 10.8 (or a missing OS trap of bad inputs added in 10.8)
b) compiler bug that causes bad inputs (see a)
c) code in mac drivers in this edge case (fast open/close) has a bug that simply passed bad args to the OS
???
(Reporter)

Comment 12

6 years ago
(In reply to Christoph Diehl [:cdiehl] from comment #0)
> newton % sw_vers
> ProductName:	Mac OS X
> ProductVersion: 10.8.2
> BuildVersion:	12C3006

It's not a)
cdiehl: I assume that means your crashes were under 10.8?

d) different Mac Camera drivers react differently

I have a MacBook Pro laptop, non-retina, Core i7.  (how do I get driver/HW info from it?)  I know jib has a newer mac
(Reporter)

Comment 14

6 years ago
(In reply to Randell Jesup [:jesup] from comment #13)
> cdiehl: I assume that means your crashes were under 10.8?

Correct

> I have a MacBook Pro laptop, non-retina, Core i7.  (how do I get driver/HW
> info from it?)  I know jib has a newer mac

$ system_profiler | more
I got the same error as jib in bug 815231 when running the testcase (constant errors like that).  May be a timing issue too
        FaceTime HD Camera (Built-in):

          Product ID: 0x8509
          Vendor ID: 0x05ac  (Apple Inc.)
          Version: 5.16
          Serial Number: CC2B7C06LLDGFKL0
          Speed: Up to 480 Mb/sec
          Manufacturer: Apple Inc.
          Location ID: 0xfa200000 / 3
          Current Available (mA): 500
          Current Required (mA): 500

      System Version: Mac OS X 10.7.5 (11G56)
      Kernel Version: Darwin 11.4.2
Yes I have a Retina MacBook Pro laptop "Mid-2012" Core i7 and I'm NOT crashing.

            FaceTime HD Camera (Built-in):

              Product ID: 0x8510
              Vendor ID: 0x05ac  (Apple Inc.)
              Version: 80.25
              Serial Number: CC2C9Q0K4NDN9KE0
              Speed: Up to 480 Mb/sec
              Manufacturer: Apple Inc.
              Location ID: 0x1a110000 / 3
              Current Available (mA): 500
              Current Required (mA): 500

      System Version: OS X 10.8.2 (12C3006)
      Kernel Version: Darwin 12.2.1
I'm failing on every second click of "Crash".  Interestingly, that was the pattern I saw with bug 815231 - one would succeed, the second would fail with errors in the logs, and repeat for an hour.

I got a webrtc_trace:65535 log from it, and it closed the capture device as the last thing it did.  Interestingly, this is the same point as where we were having issues with shutdown hangs due to sending events tot he mainthread in the QTKit code.  I'm adding the "proxy-release-to-mainthread" patch to this build to see if it helps.
To the newly cc'd people (smichaud, joedrew, bjacob): this seems to be deep in Mac-land, and the QTKit docs suck at this level.  We could use some assistance in attacking this.  Google searches on these error messages isn't finding much.  

With the shutdown hang patch that remotes the shutdown to mainthread the problem still happens, and perhaps even more easily or with more error messages before the crash.  And oddly, others like jib can't reproduce it at all.  (I suspect strongly an ASAN build is not needed to hit this, though an opt build may be.)

Any help or pointers would be greatly appreciated.  The code talking to the mac stuff is deep in media/webrtc/trunk, in .mm files.

Thanks!
+Jeff, who has done a lot of very deep debugging in places like this before.

Before you do anything else, though, I would send this testcase to Apple through our developer contacts (Steven should know how to use our developer resources) in order to ensure they know they've got a bug in their driver.
Joe - thanks, that makes sense and matches what I told akeybl.  It seems totally related to how fast you open and close the driver.
FYI, I also tried waiting until isRunning changed after startRunning/stopRunning, with no effect
> Steven should know how to use our developer resources

I don't.  And I'm not sure we really have any "Apple developer resources", if that's what you're asking about.  I'll CC Josh Aas to see if he might be able to help.

As for this bug, please tell me *exactly* what I need to do to be able to reproduce it.  If I need to do a special build, suggest mozconfig options for it.  If additional steps are required (beyond just starting the build and running the testcase), please let us know.  Finally let us know if special hardware is required (sounds like you need a camera, as can be found on a MacBook Pro).

I'm good at reverse engineering.  So once I know how to reproduce this bug, I may be able to find out what's causing it, and possibly even find a workaround.  But reverse engineering is very time consuming, so I can't promise quick results.
Ok, here are the STR:

a) use an ASAN mac build (per the instructions in the wiki).
Full opt build, no debug symbols.  I'll upload my mozconfig
b) load into gdb (you don't need to set the ASAN breakpoint, as it crashes)
c) go to the testcase here
d) click Crash; click to share the camera.  Repeat if needed.  Usually crashes on first or second try for me.  Never crashes for jib with a slightly different camera (though he gets errors (different ones).

NOTE: ASAN may very well NOT be needed; you can try first with a mac opt build. I'll try spinning one myself.  Also, it may be ok to have debug symbols in an opt build, though maybe that changes the timing enough to miss the bug.
Created attachment 714406 [details]
.mozconfig for asan
Thanks for the info.

> a) use an ASAN mac build (per the instructions in the wiki).

I assume you mean https://developer.mozilla.org/en-US/docs/Building_Firefox_with_Address_Sanitizer

My life would be easier if I could build with the clang/llvm that come with recent versions of XCode (specifically with the XCode Commandline Tools).  I'll try that and let you know my results.
I suspect that works.  I'm spinning a plain Mac opt build now
A regular full-opt build also crashes, though with a null deref instead of a random address, and crashes in CMIOUnitDALInputEntry().  It gets the same "RenderBug something has gone wrong!" errors preceding the crash.  Crashed on second select of "Crash".

This should make testing easier

clang -v says 4.1 (421.11.65) based on LLVM 3.1svn
BTW, I applied the fixes from 
https://webrtc-codereview.appspot.com/1097014
to our tree, but got the same crash (this was with asan this morning).  It does fix the "can't use Skype for video after getUserMedia" in FF (and in Chrome).
Created attachment 714468 [details]
OS X 10.6.8 crash stack on 15" Early 2011 MBP

I can get today's mozilla-central nightly to crash on a "15-inch Early 2011" MacBook Pro running OS X 10.6.8 or 10.7.5 (though for some reason not 10.8.2).  I've got gdb stack traces for both crashes.

Note that as of the addition of support for Benoit Girard's profiler, mozilla-central nightlies no longer have their symbols stripped.  So, wonderfully, gdb stack traces of mozilla-central nightlies are meaningful, and have been for about a year (if I remember right).
Created attachment 714470 [details]
OS X 10.7.5 crash stack on 15" Early 2011 MBP
Breakpad IDs for my crashes on OS X 10.6.8 and 10.7.5:

bp-b22c1d21-a9d8-4357-a37e-ab16c2130215
bp-b3df7be0-0a36-48aa-944e-495262130215
(Following up comment #32)

Note that there's serious corruption in the Breakpad stacks.  The gdb traces are *much* higher quality.
Those (at least the 10.7 ones) match what I and cdiehl (on 10.8) have hit.  jib gets 
Thu Feb 14 17:44:37 2013 CMIO_Graph.cpp:8839:HandleRenderNotify CMIOGraph::HandleRenderNotify() called but graph is not initialized!
Thu Feb 14 17:44:37 2013 CMIO_Graph.cpp:8839:HandleRenderNotify CMIOGraph::HandleRenderNotify() called but graph is not initialized!
where I crashed.  (My older test (bug 815231) for open/close would also get that error - it's probably a dup of this, though it may show even more strongly it's a timing issue.)
I tried to reproduce this crash but on my MBP (13", early 2011) it always crashes with the signature as filed on bug 815231.
I think they're different manifestations of the same general problem, and the way to produce them is extremely similar.  Can you dump your camera type/driver/etc?
FaceTime HD Camera (Built-in):

  Product ID:	0x8509
  Vendor ID:	0x05ac  (Apple Inc.)
  Version:	 5.16
  Serial Number:	CC2B2P01YVDG6LL0
  Speed:	Up to 480 Mb/sec
  Manufacturer:	Apple Inc.
  Location ID:	0xfa200000 / 2
  Current Available (mA):	500
  Current Required (mA):	500

System Version:	Mac OS X 10.7.5 (11G63)
Kernel Version:	Darwin 11.4.2
(Reporter)

Comment 38

6 years ago
The testcase of bug 815231 does not crash with m-c changeset: 122315:cc37417e2c28 for me.
Steven - any luck?  I'm working on a possible wallpaper patch to land as a backup if we can't find a root cause or get action from Apple.
Sorry, I let this one slip through the cracks ... probably because I felt obliged to list it merely as "security related" in my list of bugs on the Mozilla Status Board, and forgot what it was about :-(

I'll see if I can at least find a regression range today.
This bug's testcase doesn't do anything before the firefox-2013-01-06-03-09-02-mozilla-central nightly -- presumably because that's the first mc nightly after the patch for bug 825594 landed.

But (at least in my tests) these crashes don't start with that build.  Rather they start with the firefox-2013-01-08-03-34-57-mozilla-central nightly.

That gives us the following regression range:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=605ae260b7c8&tochange=dccab70d8554

I can't begin to guess which patch in this range might have triggered the crashes.  Furthermore I can only easily reproduce them on OS X 10.6.8, and the tree is no longer buildable there by ordinary mortals since the switch to clang (which happened with the patch for bug 733905 on 2012-07-19).

So it'll be very difficult for me to find the trigger using hg bisect.

Anyone have any ideas?
Randell:  If your wallpaper patch is at all acceptable I'd go ahead with it.
Whoever can reproduce this bug, please test my regression range from comment #41.
If that's the regression range, then I see two possible bug candidates:

- bug 827007
- bug 826576
Steven - those bugs Jason mentions make it actually Stop() the stream before destroying it and deal with some timing holes in the MSG code with RemoveListener().  So bisecting to find that it's one of those patches doesn't tell us a lot, other than somehow stopping the capture before separately from destroying the capture object changes the timing or "something else" to trigger it.

I can crash it reliably on 10.7.5.  cdiehl can crash it on 10.8.2.  It *seems* to be HW dependent as well as (maybe) OS dependent.
I'll try to bisect, but I don't think it will tell me anything useful other than it's timing-related.

I don't have a wallpaper patch yet; going to work on it tonight more.  Basically it will be a delay in shutting down the camera on Mac (min time between startup and shutdown).  If that doesn't work, wait for a frame (and maybe max N seconds for safety).  I do not know this will work; I merely presume/hope it will.
Randell and cdiehl:  Please please please try testing with mozilla-central nightlies.  And if you can repro please look for regression range(s).

Note that if this is an Apple bug (and perhaps more than one), we may not be able to "fix" it -- we may only have the choice between different kinds of workaround.
One issue to consider (since some of our code is off the main thread) is thread safety -- some Apple APIs are documented not to be thread safe.
> The code talking to the mac stuff is deep in media/webrtc/trunk, in
> .mm files.

I've just started looking through the .mm files at this location.
Note that I'm completely unfamiliar with this code or with the QTKit.

As best I can tell from a quick look-through, the low-level mac stuff
relevant to this bug is all in the following file:

media/webrtc/trunk/webrtc/modules/video_capture/mac/qtkit/video_capture_qtkit_objc.mm

Is this correct?

(Since no rendering seems to occur in the testcase, I assume the
rendering code isn't relevant.)

For reference here's Apple's QTKit Framework Reference:
https://developer.apple.com/library/mac/#documentation/QuickTime/Reference/QTCocoaObjCKit/_index.html
(In reply to Steven Michaud from comment #49)
> As best I can tell from a quick look-through, the low-level mac stuff
> relevant to this bug is all in the following file:
> 
> media/webrtc/trunk/webrtc/modules/video_capture/mac/qtkit/
> video_capture_qtkit_objc.mm
> 
> Is this correct?

That's correct. Note also that the locking around the buffer access is broken, but when I asked jesup to test my patch that fixes that <http://pastebin.mozilla.org/2141915>, it did not solve the problem:

12:55:44 < jesup> derf: for fun I tried your lock patch on bug 837539.  No joy
Yes, that's the location of the code.

I've seen the docs; they don't really help.  There's some level of thread safety here, as we had to proxy a shutdown release to mainthread because mainthread had stopped processing native events in shutdown.  They apparently use messages to the main native event loop to implement thread safety.  I believe I tried doing that always, and it didn't help (though it changed the number of duplicate errors I got before it crashed).
I did a search on 'QTKit "thread safe"' and found the following Apple doc:
http://developer.apple.com/library/mac/#technotes/tn2125/_index.html

It has a section on "QTKit Capture":
http://developer.apple.com/library/mac/#technotes/tn2125/_index.html#//apple_ref/doc/uid/DTS10003392-CH1-SUBSECTION17

This makes it sound like "QTKit Capture" is generally thread safe, though there are some precautions to take.
> They apparently use messages to the main native event loop to
> implement thread safety.

I don't understand this :-)

Is it implemented in current code?  And if so where?
>    QTCaptureSession*                    _captureSession;
>    QTCaptureDeviceInput*                _captureVideoDeviceInput;
>    QTCaptureDecompressedVideoOutput*    _captureDecompressedVideoOutput;

These three objects, from video_capture_qtkit_objc.h, appear to be the only QTKit objects used by our low-level code.  In principle we should be serializing access to them with locks, but aren't currently doing so.

I'm not sure this matters (the same purpose may already be accomplished in higher-level code).  But I thought I'd mention it.

That's all for today.  I'm off to shovel some snow!
> In principle we should be serializing access to them with locks, but
> aren't currently doing so.

Oops, I missed _rLock.

And otherwise none of the unlocked accesses to these objects seem to
need protection.
>> They apparently use messages to the main native event loop to
>> implement thread safety.
>
> I don't understand this :-)

Now I understand it, I think.  Here's an example from our source code:

    [_captureDecompressedVideoOutput
      performSelectorOnMainThread:@selector(setPixelBufferAttributes:)
                       withObject:captureDictionary waitUntilDone:NO];
//    [_captureDecompressedVideoOutput setPixelBufferAttributes:captureDictionary];

What the substitute (not commented out) code does is to arrange for
this method to be called on the main thread -- which (of course)
guarantees that different calls to the method won't happen
simultaneously.
Tomorrow I'll do a test build with a bunch of debug logging (NSLog statements) added to video_capture_qtkit_objc.mm.  I'll also knock out selected bits and pieces of code, to see what difference that makes.

In other words, I'll semi-randomly mess around with video_capture_qtkit_objc.mm, and see where it leads me.

Do please keep looking for those regression ranges, though.  This will *not* be an easy bug to fix or work around.
Created attachment 719185 [details] [diff] [review]
Wierd workaround, likely *not* a fix

As promised I've been messing around with video_capture_qtkit_objc.mm.  Along the way I discovered that this weird patch "fixes" this bug's crashes for me, on OS X 10.6.8.

Whoever can reproduce these crashes, please let me know if it also "fixes" your crashes.

It's very unlikely that it will be an acceptable workaround.  But I want to find out if I've hit an important nerve, and if it's worthwhile trying to figure out *why* this patch "works".  Which may lead me to a more acceptable workaround.

I've started a tryserver mozilla-central build, for those who want one.  It should be available in a few hours.
Besides fixing the crash on my 15" early 2011 MBP running OS X 10.6.8, I find that it gets rid of the following error message when tested on a Retina MBP running OS X 10.7.5 (where I haven't seen any crashes):

CMIO_Graph.cpp:8709:HandleRenderNotify CMIOGraph::HandleRenderNotify() called but graph is not initialized!

So I suppose a QTCaptureVideoPreviewOutput object gets automatically initialized in a way that a QTCaptureDecompressedVideoObject doesn't, and that I need to figure out how to explicitly perform that initialization on the latter object.

That's what I'll be trying to do tomorrow.
Note that I also see a bunch of "autoreleased with no pool in place" running this bug's testcase (even on hardware and OS versions where I don't crash).  These messages are harmless and unrelated.  They're also easily fixed.

I'll do that when/if I fix or work around this bug.
Still crashes on 10.7.5 :-(

Different errors though:

Wed Feb 27 20:33:13 2013 CMIO_Unit_Output_Base.cpp:197:RenderBus something has gone wrong!
Wed Feb 27 20:33:13 2013 CMIO_Unit_Output_Base.cpp:199:RenderBus 	idx = 0, mNumInputs = 1, mNumInputsAllocatedFor = 0
Wed Feb 27 20:33:13 2013 CMIO_Unit_Output_Base.cpp:201:RenderBus 	theInput = 0x11d568ef0, mPullInputOnNextRound = 0x0, mInputHasSeenEndOfData = 0x0
2013-02-27 20:33:14.193 firefox[27453:14203] *** QTCaptureVideoPreviewOutput warning: Output received the following error while decompressing video: Error Domain=NSOSStatusErrorDomain Code=-67446 "The operation couldn’t be completed. (OSStatus error -67446.)". Make sure that the pixel buffer attributes of all outputs are valid.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x0000007c00000001
[Switching to process 27453 thread 0x14203]
0x0000007c00000001 in ?? ()
Thanks for the info.  And too bad my "fix" didn't work for you.

But your new errors could still be the result of lack of initialization, so I'll still keep trying to pursue my hunch.

By the way, I think I've ruled out the possibility that the crashes and errors are due to accessing deleted objects.  Deliberately leaking all our native objects didn't make the crashes go away.  Neither did testing with CFZombieLevel=17 (which among other things stops the "zombies" from ever being freed).

To use any kind of allocation debugging on OS X, you generally have to turn off jemalloc.  Do this by setting a NO_MAC_JEMALLOC environment variable to some value.

Even with jemalloc off, though, I still haven't got libgmalloc working in current trunk code.  Which is a shame, since this used to be one of my favorite tools.
That would fit with an early theory of mine: if we shut it down before it captures a frame, <something> isn't initialized.
Folks, can someone give me a testcase that captures the camera and does something with it?  Even one that just displays the camera output in the box.

That should help me figure out how initialization normally takes place.

Also it'd be nice to have a capture-the-camera-and-displays-its-output testcase that works in other browsers, like Safari and Chrome.  That way I'd be able to observe any tricks that those guys play to get things working.  (I hope and assume that Safari, at least, would also use QTKit to do this.)
> Even one that just displays the camera output in the box.

Even one that just displays the camera output in *a* box.
http://mozilla.github.com/webrtc-landing/gum_test.html

Chrome also uses QTKit. See src/media/video/capture/mac/video_capture_device_qtkit_mac.mm
Created attachment 719739 [details]
Another wierd workaround, definitely *not* a fix

I keep digging around in the bowels of OS X, and I think I've found the key to this bug (at least on OS X 10.7 and 10.8) -- undocumented methods in an undocumented framework (the CoreMediaIO framework).

Here's the source for an interpose library that hooks both of these methods, and prevents one of them (CMIOUninitializeGraph()) from actually doing anything.  I'd like the people who can reproduce this bug's crashes (on OS X 10.7 and 10.8) to try it and see if it stops the crashes from happening.

If my hunch is right, the root cause of this bug's crashes (on 10.7 and 10.8) is that CMIOUninitializeGraph() is sometimes getting called "too early".

To test with this interpose library, you first need to compile it (just run make).  Then set the DYLD_INSERT_LIBRARIES environment variable as follows at a Terminal prompt:

export DYLD_INSERT_LIBRARIES=[full/path/to/]interpose.dylib

Then run your test build from the same Terminal prompt.
that dylib interpose took me from 100% crashing on the second "Crash" click to not being able to crash it!

I think we're on the right track!
Glad to hear it!

Tomorrow I'll see if I can turn my hunch into a real patch.
Hmm...I don't think a crashtest will help here. This is happening way low at the OS integration layer, which fake streams don't touch. So we probably can't a crashtest here.
Flags: in-testsuite? → in-testsuite-
Created attachment 720012 [details] [diff] [review]
Fix (partial, not yet for autorelease pool errors)

For a bug that's been so much trouble to figure out, this has an absurdly easy fix :-)

Please try out this patch and let us know your results.  It's worked for me in my tests (getting rid of the crashes on OS X 10.6.8 and the initialization errors on OS X 10.7.5).

I'll await your results before posting a patch that also fixes the autoreleased with no pool errors.
Assignee: rjesup → smichaud
Status: REOPENED → NEW
Created attachment 720017 [details]
Interpose library for logging

Here's another interpose library, this time just for logging.  On OS X 10.7 and 10.8, it shows CMIOUninitializeGraph() being called "too early" (on a secondary thread) in builds without my fix, and being called "on time" in builds with my fix.

See the instructions from comment #68 on how to use it.
Given that this bug is caused by a lack of initialization, and not (apparently from my results in comment #63) from accessing deleted objects, I doubt very much that it has any security implications.
Created attachment 720070 [details] [diff] [review]
Full fix (including for autorelease pool errors)

I decided not to wait for your tests ... but I'd still like to hear their results.

I'm not sure who to ask for a review.  If you don't feel comfortable reviewing this, Randell, please let me know.
Attachment #720012 - Attachment is obsolete: true
Attachment #720070 - Flags: review?(rjesup)
Comment on attachment 720070 [details] [diff] [review]
Full fix (including for autorelease pool errors)

Review of attachment 720070 [details] [diff] [review]:
-----------------------------------------------------------------

Releasing on the main thread matches some things I found in my investigations on the web.  (None of which were definitive unfortunately).

I don't feel I know enough about the qtkit/ObjC stuff to comment on the pool changes.  Are they needed?  If so, can you get a Mac person to look at them?

r+ for the proxy to main-thread.  (We had to do this already for handling close of the top-level webrtc engine on Mac in shutdown, because it internally proxied to the native Main Event loop, but in shutdown we'd stopped processing it.)

::: media/webrtc/trunk/webrtc/modules/video_capture/mac/qtkit/video_capture_qtkit.mm
@@ +54,5 @@
>  
>  VideoCaptureMacQTKit::~VideoCaptureMacQTKit()
>  {
>  
> +    nsAutoreleasePool localPool;

I'm guessing these take the place of the commented-out NSAutoreleasePools in the *_objc.mm files?

::: media/webrtc/trunk/webrtc/modules/video_capture/mac/qtkit/video_capture_qtkit_info_objc.h
@@ +23,5 @@
>  
>  @interface VideoCaptureMacQTKitInfoObjC : NSObject{
>      bool                                _OSSupportedInfo;
>      NSArray*                            _captureDevicesInfo;
> +    //NSAutoreleasePool*                    _poolInfo;

What's the point of leaving these commented out?

::: media/webrtc/trunk/webrtc/modules/video_capture/mac/qtkit/video_capture_qtkit_info_objc.mm
@@ +138,5 @@
>      {
>          return [NSNumber numberWithInt:0];
>      }
>  
> +    //_poolInfo = [[NSAutoreleasePool alloc]init];

What's the point of leaving these commented out?
Attachment #720070 - Flags: review?(rjesup) → review+
Comment on attachment 720070 [details] [diff] [review]
Full fix (including for autorelease pool errors)

> I don't feel I know enough about the qtkit/ObjC stuff to comment on
> the pool changes.

> Are they needed?

Yes.  Though the problems they fix are unrelated to the crashes.

> If so, can you get a Mac person to look at them?

Will do (on Monday).

> +    nsAutoreleasePool localPool;
>
> I'm guessing these take the place of the commented-out
> NSAutoreleasePools in the *_objc.mm files?

In a way.  They do correctly what the commented out stuff did
incorrectly.

> What's the point of leaving these commented out?

No point, really.  They're completely useless.  I'll get rid of them.
Randell, I assume you did test my patch, and that it did get rid of your crashes :-)
Oh, you wanted a test as well as a review?  ;-)  Yes, it seems to reliably solve my crashes, and I crashed 100% of the time before.
Thanks for the info.  Now I feel better.

Remember that my first test patch worked perfectly for me (on two different machines and OS versions), but not for you.
Created attachment 720707 [details] [diff] [review]
Full fix (get rid of _poolInfo instead of commenting it out)
Attachment #720070 - Attachment is obsolete: true
Attachment #720707 - Flags: review?
Comment on attachment 720707 [details] [diff] [review]
Full fix (get rid of _poolInfo instead of commenting it out)

Carrying forward rjesup's r+.
Attachment #720707 - Flags: review? → review+
We need this fixed ASAP so it can be uplifted for Beta 3 hopefully.  Otherwise I need to drop stuff and try to finish a working wallpaper patch (probably "wait until a frame comes in or X seconds before stopping").  Can you find a reviewer for the rest?  Ask on #developers?  Thanks!
Alternatively, if that's tough or if you believe the pool bug fixed here is unrelated to the reported problems here and/or minor, we could split this into two bugs, and only uplift the close one.
(Assignee)

Updated

6 years ago
Attachment #720707 - Flags: review?(bgirard)
(I'd actually already asked Benoit Girard to review in comment #81, but for some reason it didn't stick.)
Comment on attachment 720707 [details] [diff] [review]
Full fix (get rid of _poolInfo instead of commenting it out)

Review of attachment 720707 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/trunk/webrtc/modules/video_capture/mac/qtkit/video_capture_qtkit_info.mm
@@ +13,5 @@
>  #import "video_capture_qtkit_info_objc.h"
>  
>  #include "video_capture.h"
>  
> +class nsAutoreleasePool {

Why can't we reusage an existing copy such as http://mxr.mozilla.org/mozilla-central/source/widget/cocoa/nsCocoaUtils.h#43 ? This pattern is everywhere. I don't think it impacts our release code but it bloats our symbols/debug info.
Attachment #720707 - Flags: review?(bgirard) → review+
> Why can't we reusage an existing copy such as
> http://mxr.mozilla.org/mozilla-central/source/widget/cocoa/nsCocoaUtils.h

The WebRTC stuff is in a separate executable, so I didn't think to try
to link in stuff from the rest of the tree.  But I'd forgotten that
we'd implemented nsAutoreleasePool in a header file, which might work.
I'll try that, and land it if it works.

Otherwise we eventually want to create an equivalent of nsCocoaUtils.h
for the WebRTC stuff, which I suspect needs a *lot* of cleanup.
However I don't want to do that now.
Landed on mozilla-inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/9fa5887d41e3

The WebRTC stuff has a completely different build system than Firefox does (more like Chromium's).  So I quickly gave up trying to figure out how to include a header file from elsewhere in the Mozilla tree.

I also discovered (and removed) one more useless NSAutoreleasePool* in video_capture_qtkit_objc.h and video_capture_qtkit_objc.mm.

The autorelease pool errors are harmless and unrelated, but they are also very visible and annoying.  They fill up the system console with useless error messages, and make it a lot more difficult to use the system console for debugging.
Created attachment 720758 [details] [diff] [review]
Full fix (what I actually landed)

Carrying forward a single r+ for rjesup's and bgirard's r+es.
Attachment #720707 - Attachment is obsolete: true
Attachment #720758 - Flags: review+
(In reply to Steven Michaud from comment #88)
> The WebRTC stuff has a completely different build system than Firefox does
> (more like Chromium's).  So I quickly gave up trying to figure out how to
> include a header file from elsewhere in the Mozilla tree.

Alright not a problem
(In reply to Steven Michaud from comment #88)
> Landed on mozilla-inbound:
> http://hg.mozilla.org/integration/mozilla-inbound/rev/9fa5887d41e3

Randell, can you please file an issue/patch upstream?
https://hg.mozilla.org/mozilla-central/rev/9fa5887d41e3
Status: NEW → RESOLVED
Last Resolved: 6 years ago6 years ago
status-firefox22: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Randell had a solid way to crash this and verified this crash with this patch. So I don't think this needs QA testing here.
Whiteboard: [getUserMedia][blocking-gum+][webrtc-uplift] → [getUserMedia][blocking-gum+][webrtc-uplift][qa-]
(In reply to Jason Smith [:jsmith] from comment #94)
> Randell had a solid way to crash this and verified this crash with this
> patch. So I don't think this needs QA testing here.

verified this crash no longer reproduces*
Comment on attachment 720758 [details] [diff] [review]
Full fix (what I actually landed)

[Approval Request Comment]
Bug caused by (feature/regressing bug #): original import

User impact if declined: Crashes on some macs when you have fast opens/closes.  It's possible there's a (very?) tough to exploit security vulnerability here given a few of the stacks had random addresses logged.

Testing completed (on m-c, etc.): On m-c.  Verified by myself (was 100% reliable crash) and Jason.  cdiehl may be able to verify as well.


Risk to taking this patch (and alternatives if risky): Low - a typical "proxy-to-native-mainthread" plus some incorrect pool cleanups.  Very unlikely to be worse than the current state.  Likely safer than a wallpaper patch, very likely that it's no more risky (wallpaper patch would touch a lot more code and would be harder to verify). 


String or UUID changes made by this patch: none.
Attachment #720758 - Flags: approval-mozilla-beta?
Attachment #720758 - Flags: approval-mozilla-aurora?
Comment on attachment 720758 [details] [diff] [review]
Full fix (what I actually landed)

Given the sec rating, that this has already landed to central and that we're about to go to build on FF20 beta 3, approving for uplift.  Please land today so this gets into beta 3.
Attachment #720758 - Flags: approval-mozilla-beta?
Attachment #720758 - Flags: approval-mozilla-beta+
Attachment #720758 - Flags: approval-mozilla-aurora?
Attachment #720758 - Flags: approval-mozilla-aurora+
status-b2g18: --- → unaffected
status-firefox-esr17: --- → unaffected
Whiteboard: [getUserMedia][blocking-gum+][webrtc-uplift][qa-] → [getUserMedia][blocking-gum+][webrtc-uplift][qa-][adv-main20-]
Whiteboard: [getUserMedia][blocking-gum+][webrtc-uplift][qa-][adv-main20-] → [getUserMedia][blocking-gum+][qa-][adv-main20-]
Group: core-security
You need to log in before you can comment on or make changes to this bug.