Closed Bug 1164480 Opened 5 years ago Closed 5 years ago

Crashes on startup in AppleVDADecoder::CreateOutputConfiguration() in content process

Categories

(Core :: Audio/Video, defect, critical)

All
macOS
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla41
Tracking Status
firefox39 + verified
firefox40 + fixed
firefox41 + verified

People

(Reporter: smichaud, Assigned: smichaud)

References

Details

(Keywords: regression, reproducible, topcrash-mac, Whiteboard: [STR in comment #17] )

Crash Data

Attachments

(3 files, 1 obsolete file)

For example bp-5132843b-d0dc-4398-bfe0-c6ad82150512.

These started with the 20150509030210 mozilla-central nightly.

They are similar to (and may be related to) the crashes at 1153809.
Bug 1153809 turned out to be a problem with the content process sandbox.  So, probably, will these.  They only happen on OS X 10.9 and up (and the content process sandbox is only enabled (only has a non-empty ruleset) on 10.9 and up).

These are a Mac topcrasher on the 41 branch.
Crash Signature: [@ CoreFoundation@0x117d8 ] [@ CoreFoundation@0x11a5b ]
Keywords: topcrash-mac
Keywords: regression
Implied regression range:

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=86203ac87a08&tochange=39dc888ce14c

Nothing here jumps out at me.  And in any case it'd (probably) be the trigger, not the cause.
These appear to all be crashes on startup.  The most common URL is about:blank.
Summary: Crashes in AppleVDADecoder::CreateOutputConfiguration() in content process → Crashes on startup in AppleVDADecoder::CreateOutputConfiguration() in content process
Jya, can you identify the trigger in the implied regression range from comment #2?

If we knew it, it might be easier to come up with STR.  (And of course there's little we can do here without STR.)
Flags: needinfo?(jyavenard)
These crashes may turn out to be hardware-specific (as were those for bug 1153809).

There's a strong correlation with Intel HD Graphics 4000 hardware (vendor 0x8086, device 0x0166).  But it's by no means the only kind of graphics hardware on which these crashes happen.  And Intel isn't the only vendor.

Here are some others:

Vendor   Adapter ID

Intel    0x0d26
Intel    0x0a26
Intel    0x0116

https://crash-stats.mozilla.com/search/?signature=~CoreFoundation%400x117d8&version=40.0a1&_facets=signature&_facets=adapter_vendor_id&_facets=adapter_device_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-adapter_vendor_id
My (Retina) MacBookPro10,1 has Intel HD Graphics 4000 hardware!  Later I'll try to reproduce these crashes ... or at least see if I get any sandbox errors.
I'm unable to reproduce these crashes, on either OS X 10.10.3 or 10.9.5.  I also don't see any video-hardware-specific sandbox errors (in the system console).

I tested yesterday's m-c nightly with a fresh profile.  I tested with and without gfxCardStatus (https://gfx.io/, which I used to force the nightly to always use the "integrated" Intel HD Graphics 4000 hardware on my MacBookPro10,1).

So, Jya, please do see if you can identify this bug's trigger.  Without that we may never be able to find STR.
I took another look at our crash stacks and noticed that mozilla::CreateTestH264Decoder was added by the patch for bug 1160321, which is in the implied regression range from comment #2.  So that's the most likely trigger.

Cpearce, do you have any idea how that patch could have caused/triggered these crashes?
Blocks: 1160321
Flags: needinfo?(cpearce)
http://hg.mozilla.org/mozilla-central/rev/03dc784d5456

 Bug 1160321 - Test whether we can create H.264/AAC decoders before we report we support them. r=mattwoodrow
author	Chris Pearce <cpearce@mozilla.com>
	Fri, 08 May 2015 13:36:32 +1200
Yes, this is the one I was going to point to. 

It would be interesting to see if applying the patches from bug 1163458 fix the problem, in particular path #2.

Currently, the apple PlatformDecoderModule constantly link/unlink the framework (the last decoder when destroyed unlink the framework).
Bug 1160321 causes the creation / deletion of a dummy PDM to ensure that we can decode video.

A quick fix possibility would be to make the code introduced by bug 1160321 conditional, as i can't foresee a reason creating a PDM wouldn't work on any mac we support (>= 10.6)
Jean-Yves
Flags: needinfo?(jyavenard)
[Tracking Requested - why for this release]:

This is a Mac topcrasher.  Currently only the 41 branch (currently trunk) is affected, because e10s mode is only available on the trunk.  But if we allow e10s mode to be turned on in the Developer Edition, the 40 branch would also be affected.
This JS code should trigger CreateTestH264Decoder:

document.createElement("video").canPlayType("video/mp4; codecs=\"avc1.42E01E, mp4a.40.2\"");

Following up on what jya was saying; we create a PlatformDecoderModule in CreateTestH264Decoder(), and use that to create a MediaDataDecoder, which we return. However we'll destroy the PDM when we exit CreateTesth264Decoder, which unlinks the system video decoding libraries, so when the caller uses the returned MediaDataDecoder we could end up calling into code in the system libraries that has been removed from our process space. So we may need to return the PDM as well as the MediaDataDecoder to ensure it doesn't die.
Flags: needinfo?(cpearce)
Steven: Try this patch, it may help.
Attachment #8605606 - Flags: feedback?(smichaud)
Tracking enabled for 40 and 41.
(In reply to comment #13)

I'm not (yet) able to reproduce these crashes, so there's no point (yet) in trying your fix, Chris.

I've noticed that CreateTestH264Decoder doesn't normally get called on startup (where these crashes happen).  I'll try to figure out why that's happening.  The code snippet from comment #12 doesn't crash for me when called at "normal" times (on my MacBookPro10,1, on OS X 10.10.3).

In that regard it's odd that so many of the crash URLs are "about:blank".  You'd think the only way to trigger a call to CreateTestH264Decoder is to load a web page.  Does anyone know if there exceptions to this (presumed) rule?
Putting the code snippet from comment #12 in a JavaScript onload handler and reloading the testcase on startup doesn't trigger these crashes.  But, in any case, doing that doesn't call CreateTestH264Decoder the way it's called in this bug's crash stacks.  That kind of testcase produces a call to CreateTestH264Decoder (indirectly) from nsDocShell::EndPageLoad().  But none of this bug's crash stacks have that call in them.

Instead, what all this bug's crash stacks have is a call to nsHtml5TreeOpExecutor::RunFlushLoop(), called (indirectly) from nsTimerImpl::Fire().

Anyone know what that means?  Anyone know how to trigger a call to CreateTestH264Decoder (indirectly) from nsHtml5TreeOpExecutor::RunFlushLoop()?
Hooray, I've got STR for this bug!

1) Set a current m-c nightly to "show my windows and tabs from last time" "when Nightly starts".
2) Visit http://rss.slashdot.org/~r/Slashdot/slashdot/~3/GVTelbFuSos/the-decline-of-pixel-art
3) Quit Nightly and restart.

I got the URL from crash stats.  Interestingly my own crash reports all show the URL as "about:blank" or "about:home".  That's a bug, I suppose.

bp-51fcefe2-bf23-4435-a492-d4aeb2150514
bp-89feb3c0-82d1-4db2-bae1-64d652150514
bp-e00f9876-8a8e-4be9-b3dc-2e5062150514
bp-6e8549ac-88f8-4a4f-91e3-7fc2c2150514
Keywords: reproducible
Whiteboard: [STR in comment #17]
> bp-51fcefe2-bf23-4435-a492-d4aeb2150514
> bp-89feb3c0-82d1-4db2-bae1-64d652150514
> bp-e00f9876-8a8e-4be9-b3dc-2e5062150514
> bp-6e8549ac-88f8-4a4f-91e3-7fc2c2150514

The interpose.dylib in some of these is my own interpose library that I was using for debug logging.  You won't see it in other crash logs ... and no, it doesn't trigger the crashes :-)
My STR doesn't work with e10s off, even though (in that case) CreateTestH264Decoder is still called (indirectly) from nsHtml5TreeOpExecutor::RunFlushLoop().
The content process sandbox isn't involved here.  Setting security.sandbox.content.level doesn't make these crashes go away.
> 2) Visit http://rss.slashdot.org/~r/Slashdot/slashdot/~3/GVTelbFuSos/the-decline-of-pixel-art

You need to do this in the current tab, or make it current before you quit and restart.
My STR from comment #17 "works" (I crash) on all supported versions of OS X (back to 10.6.8), and on all video hardware on which I tried it.  So this bug is not OS-X-version-specific or hardware-specific.
Comment on attachment 8605606 [details] [diff] [review]
Patch: potential fix

This patch doesn't help.  With it my STR from comment #17 still crashes.

My guess is that some kind of necessary initialization doesn't happen, in e10s mode, before the first call to CreateTestH264Decoder (via nsHtml5TreeOpExecutor::RunFlushLoop(), on startup).  In any case, it does seem that the call to CreateTestH264Decoder happens "too early" in e10s mode (in the content process).
Attachment #8605606 - Flags: feedback?(smichaud) → feedback-
> Interestingly my own crash reports all show the URL as "about:blank"
> or "about:home".  That's a bug, I suppose.

I've opened bug 1165017 on this issue.
I wonder if this bug and bug 1165017 are related -- if both bugs aren't caused by some kind of global initialization taking place "too late" in the content process.
I no longer think bug 1165017 is related -- it seems characteristic of almost all content process crashes, on all platforms.
I was able to reproduce this on my 2015 MBP.

I am going to #ifdef the code that causes this out on everything except Windows in bug 1163814, since it causes problems on Android too. At this stage I think we only really need this code on Windows, so Steven I don't think you need to spend more time trying to make it work on MacOSX.

I have a patch in bug 1163814 going through Try. I will hopefully be able to land it later on tonight if the tree re-opens, but Steven you're welcome to land it for me tomorrow (which is Saturday for me) if I don't get to it tonight.
Attached patch Fix (obsolete) — Splinter Review
I couldn't give up on this, after already having spent so much time on it :-)

The fix is *very* simple.  The problem is that (with this bug's STR) MacIOSurfaceLib::kPropIsGlobal (a CFStringRef) is getting referenced before it's been initialized (before MacIOSurfaceLib::LoadLibrary() has been called and the IOSurface framework has been dlopened).  My patch just uses a copy of the CFStringRef to which MacIOSurfaceLib::kPropIsGlobal will eventually point.
Assignee: nobody → smichaud
Attachment #8606531 - Flags: review?(jyavenard)
I have to say, I'm always super impressed on your ability to reproduce a bug and find a fix. This is very impressive!

how can MacIOSurfaceLib::LoadLibrary() not be called earlier though? Shouldn't we make the reference to kPropIsGlobal a function getter instead that would load the library as required ?
Comment on attachment 8606531 [details] [diff] [review]
Fix

Review of attachment 8606531 [details] [diff] [review]:
-----------------------------------------------------------------

A bit unfortunate that work around. We did lots of work to avoid copying those CFSTR and use the system's one.
Attachment #8606531 - Flags: review?(jyavenard) → review+
> I have to say, I'm always super impressed on your ability to
> reproduce a bug and find a fix. This is very impressive!

Thanks! :-)

> Shouldn't we make the reference to kPropIsGlobal a function getter
> instead that would load the library as required?

I gave some thought to calling MacIOSurfaceLib::LoadLibrary() at the
beginning of the function.  If we were going to do this "properly", I
think that'd be better than writing a getter for kPropIsGlobal.

But I'd want to look at the context before committing to this.  In
particular I'd want to find out if there is some reason we don't call
MacIOSurfaceLib::LoadLibrary() as early in the content process as we
do in the main process.

The patch I posted has the advantage of being very simple, though, and
almost certainly very safe.  Apple has long followed the pattern of
making its CFStringRef variable names match their values -- so, for
example, the value of kCFBooleanTrue is CFSTR("kCFBooleanTrue"), and
this is exactly what you'd expect.

Let me look further into this on Monday.
> In particular I'd want to find out if there is some reason we don't
> call MacIOSurfaceLib::LoadLibrary() as early in the content process
> as we do in the main process.

As best I can tell it's just an accident:  For some reason the calls
to nsHtml5TreeOpExecutor::RunFlushLoop() start much earlier (as
Firefox is being started up) in the content process than in the main
process.  Since calls to RunFlushLoop() are supposed to be blocked
while scripts are running, my guess is that in the main process
they're blocked until fairly late by Chrome-priviledged scripts that
run on startup.

So in the main process the MacIOSurfaceLib libraries are never
dlopened from calls to RunFlushLoop() (by the time calls to
RunFlushLoop() start, these librararies have always already been
dlopened by other means).  I don't *know* that it's harmful to do this
from RunFlushLoop(), but my gut tells me it's probably safer not to.

So I'm going to go with my original fix ... once mozilla-inbound
reopens.
Attached patch Fix (unbitrot)Splinter Review
Oddly, the file I patched seems to have moved very recently to elsewhere in the tree.  So I had to unbitrot my patch before landing.

Carrying forward r+.
Attachment #8606531 - Attachment is obsolete: true
Attachment #8607300 - Flags: review+
the PlatformDecoderModules have moved to their own directory (dom/media/platforms)
https://hg.mozilla.org/mozilla-central/rev/d9b6cc4ce4cc
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
I don't know how my patch for this bug could have caused bug 1166394, but it seems to have done so.

I'm going to back it out on mozilla-inbound as soon as I can.
This bug is now REOPENED.  But it's much less urgent than before, since the crashes have been otherwise taken care of by bug 1163814.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [STR in comment #17] → [STR in comment #17] [keep open]
> This has landed on mozilla-central:

The backout has landed on mozilla-central.
Alternative fix, just makes sure MacIOSurfaceLib is loaded early in startup.
Attachment #8608445 - Flags: review?(jyavenard)
Comment on attachment 8608445 [details] [diff] [review]
Load MacIOSurfaceLib early

Review of attachment 8608445 [details] [diff] [review]:
-----------------------------------------------------------------

I'm really not the best person to review this and I don't know the big picture on when that code is called. But if that works :)
Attachment #8608445 - Flags: review?(jyavenard) → review+
I've looked around for a way to repair my original patch, but can't find one.  So it looks like we'll need to keep the reference to the kIOSurfaceIsGlobal symbol (a CFStringRef) in the IOSurface framework.  So Matt's approach (if it works as expected) is probably best.

At first (after having recovered from being stunned by the bug 1166394 failures and others), I thought there might be confusion about exactly what kind of indirection we need (for example that we might need a CFStringRef* instead of a CFStringRef, or that we can't use the results of CFSTR() directly).  All my variants on using CFSTR() directly or indirectly (or of using other ways to create CFStringRefs) still failed in the same way.

Then I noticed that the IOSurface we're creating needs to be passed between processes (bug 1064847 comment #14), and that the failures only occur in e10s mode -- only when the IOSurface actually is being passed between processes.  So somehow, for kIOSurfaceIsGlobal we *must* use the symbol in the IOSurface framework.  Somehow the OS can flatten/serialize that when passing the IOSurface between processes, but not just any old CFStringRef.

At some point I'd like to get to the bottom of this.  But right now I don't have the time.  And in any case this bug is no longer urgent.
For future reference, here are two IOKit methods that might be useful in getting to the bottom of the problem described in comment #45:

CFDataRef IOCFSerialize(CFTypeRef object, CFOptionFlags options);
CFTypeRef IOCFUnserialize(const char *buffer,
                          CFAllocatorRef allocator,
                          CFOptionFlags options,
                          CFStringRef *errorString);

http://opensource.apple.com/source/IOKitUser/IOKitUser-1050.1.21/
(In reply to Steven Michaud from comment #45)
> I've looked around for a way to repair my original patch, but can't find
> one.  So it looks like we'll need to keep the reference to the
> kIOSurfaceIsGlobal symbol (a CFStringRef) in the IOSurface framework.  So
> Matt's approach (if it works as expected) is probably best.

I think the issue isn't just about having that symbol not defined.
CreateOutputConfiguration() just creates the dictionaries. But the callers of CreateOutputConfiguration() in particular VDADecoderCreate() or VTDecompressionSessionCreate() themselves would likely refer to the IOSurface framework internally.

So I don't think we could solve the problem just by copying the key.
For CFStringRefs, it isn't supposed to matter if you have two separate copies or two references to one original.  CFEqual is supposed to return 'true' for two different but lexically identical CFStringRefs, and CFHash is supposed to return the same value for each.

So I still do think serialization/flattening is involved here ... but it would take a long time to figure out exactly how.
For those who are curious, I'd start my investigation by hooking calls to IOCFSerialize and IOCFUnserialize, particularly from the IOSurface framework, using http://people.mozilla.org/~stmichaud/ReverseEngineering/InterposeLibraryTemplate/.
(In reply to Steven Michaud from comment #49)
> For CFStringRefs, it isn't supposed to matter if you have two separate
> copies or two references to one original.  CFEqual is supposed to return
> 'true' for two different but lexically identical CFStringRefs, and CFHash is
> supposed to return the same value for each.

this is not what I had in mind.
Not that VDA or VideoToolbox would use references to that particular CFStringRef (as you said, it doesn't matter), but that it will likely use function from the IOSurface framework, framework that hasn't been loaded yet.
Whiteboard: [STR in comment #17] [keep open] → [STR in comment #17]
https://hg.mozilla.org/mozilla-central/rev/5d6c5b470d38
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Matt, your patch has been on m-c for about a week, without causing any problems (at least that I'm aware of).  Could you request uplift to aurora (the 40 branch)?
Flags: needinfo?(matt.woodrow)
Comment on attachment 8608445 [details] [diff] [review]
Load MacIOSurfaceLib early

Approval Request Comment
[Feature/regressing bug #]: Bug 1160321 
[User impact if declined]: Startup crashes
[Describe test coverage new/current, TreeHerder]: Manually tested, few weeks on m-c
[Risks and why]: Low risk
[String/UUID change made/needed]: None.
Flags: needinfo?(matt.woodrow)
Attachment #8608445 - Flags: approval-mozilla-aurora?
Comment on attachment 8608445 [details] [diff] [review]
Load MacIOSurfaceLib early

Thanks. Taking it to remove a startup crash.
Attachment #8608445 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
It looks like Bug 1160321 was uplifted to 39. Should this go along with it?  Thanks.
Flags: needinfo?(matt.woodrow)
Comment on attachment 8608445 [details] [diff] [review]
Load MacIOSurfaceLib early

Approval Request Comment
[Feature/regressing bug #]: Bug 1160321 
[User impact if declined]: Startup crashes
[Describe test coverage new/current, TreeHerder]: Manually tested, few weeks on m-c
[Risks and why]: Low risk
[String/UUID change made/needed]: None.
Flags: needinfo?(matt.woodrow)
Attachment #8608445 - Flags: approval-mozilla-beta?
Comment on attachment 8608445 [details] [diff] [review]
Load MacIOSurfaceLib early

Approved for uplift to beta.
Attachment #8608445 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify+
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.