Closed Bug 932678 Opened 11 years ago Closed 10 years ago

[10.9] "Butterfly Demo" "hangs" then crashes the Unity plugin (after SIGSEGV at Mono:GC_mark_from + 1004 in plugin code)

Categories

(Core Graveyard :: Plug-ins, defect, P3)

x86
macOS
defect

Tracking

(firefox25 affected, firefox26 affected, firefox27 affected, firefox28 affected)

RESOLVED WORKSFORME
Tracking Status
firefox25 --- affected
firefox26 --- affected
firefox27 --- affected
firefox28 --- affected

People

(Reporter: cpeterson, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: hang, reproducible, thirdparty, Whiteboard: [summary in comment #34])

Crash Data

Attachments

(3 files)

STR:
1. Load http://unity3d.com/gallery/demos/live-demos
2. Play any of the Unity demos *except* "Butterfly Demo"
3. See the demo play as expected
4. Play the Butterfly Demo

RESULT:
After the progress bar reaches 100%, Firefox hangs and stderr logs the following messages (without actually logging the supposed stack trace):

  plugin-container[68266] <Error>: CGImageCreateWithImageProvider: invalid image size: 0 x 0.
  plugin-container[68266] <Error>: CGImageCreateWithImageProvider: invalid image size: 0 x 0.
  Stacktrace:

Both Chrome 30 and Safari 7 fail to load the Butterfly Demo, but those browsers do not hang. I am using Unity Player v4.2.2f1.
> <Error>: CGImageCreateWithImageProvider: invalid image size: 0 x 0.

These errors (or ones very much like them) go way back:  See bug 409452 and bug 400865.  Bug 409452 also mentions lots of CPU being eaten.

I suspect this is an Apple bug of some kind -- an inappropriate reaction to an error condition.  But if this is a 100% reproducible testcase we may be able to find a workaround.
See Also: → 409452, 400865
What version of OS X did you test on?
I can reproduce this hang on two different MacBook Pros (Retina and non-Retina), both running OS X 10.9.0.
Just attaching to the hung Firefox and getting a stack would be good (plugin-container also). Unless this is a 100% CPU hang, in which case a profile via instruments might be better. In any case, not high priority.
Priority: -- → P3
The Butterfly demo works just fine (for  me) in FF 24 on OS X 10.8.5 with the same version of the Unity plugin (which is the current version).  This is even though I see your error message three times.  I'll keep testing.

> Just attaching to the hung Firefox and getting a stack would be good (plugin-container also).

Only do this with a mozilla-central nightly (they don't have their symbols stripped).  Otherwise the FF-specific symbols in your stack will be all wrong.  And it's probably best to get an all-thread stack trace (thread apply all bt).
Or send them SIGABRT (in Fx27+ builds) to trigger the crash reporter and get the stacks that way. https://developer.mozilla.org/en-US/docs/How_to_Report_a_Hung_Firefox
Note that Apple's latest XCode commandline tools (for Mavericks) deliberately don't include gdb.  So you may need to use lldb instead.
I see your hang on OS X 10.9 (though I don't see high CPU usage).  Also, after about 30 seconds, I get an error page telling me that the Unity Player plugin has crashed.  This is in FF 25.

So this is presumably 10.9-specific, and is very likely to be an Apple bug.
Summary: Unity plugin "Butterfly Demo" hangs Firefox → [10.9] Unity plugin "Butterfly Demo" hangs Firefox
Summary: [10.9] Unity plugin "Butterfly Demo" hangs Firefox → [10.9] "Butterfly Demo" "hangs" then crashes the Unity plugin
(Following up comment #8)

bp-0ecc6db0-44e1-45c7-92d1-c3e0e2131030

Does this mean anything to you, Benjamin? :-)
(Following up comment #10)

These all happened on OS X 10.9, but I'm not sure they're all related.

Most of them concern the Unity plugin, and presumably *are* related.  But there are also a few for the Silverlight and DivX plugins.
(Following up comment #8 and comment #10)

I get the same thing with today's mozilla-central nightly.  But then, after I submit the report (and it appears in my about:crashes list), Socorro is unable to find it.

Note that none of the reports from comment #10 concern nightlies -- only releases (currently FF 24 and FF 25).
I used SIGABRT to trigger this crash report during the hang:

b3db3461-f563-43ea-8c81-c3f8c2131030
> b3db3461-f563-43ea-8c81-c3f8c2131030

bp-b3db3461-f563-43ea-8c81-c3f8c2131030

This crash is in the main process.  Is there another, associated crash report for the plugin process?
By the way, I'm trying to use lldb to find a stack trace of the code that displays the "invalid image size" error messages.  Haven't yet managed it.
Steven: I could crash the plugin-container with SIGABRT, but Firefox did not upload a crash report. I attached gdb to the plugin-container and dumped all the thread stack traces in the attached file.
> plugin-container-threads.txt

Unfortunately I don't see anything interesting there (or in the main process stack trace, for that matter).  Both processes, though clearly in the middle of some kind of IPC communication, are (as best I can tell) doing "normal" waiting on all threads.

Benjamin may be able to glean more out of them.  But at this point I think our best hope is to figure out what code is causing the error messages to be displayed.  By fiddling with that we may be able to work around this bug (which like I said is probably an Apple bug).
Does this happen in FF26?

The child is sending a sync message (PPluginInstanceChild::SendShow)
The parent is sending an RPC message (PPluginInstanceParent::CallPBrowserStreamConstructor).

This is inherently racy, and that's ok because the IPC mechanism should resolve the race by having the sync message (SendShow) win.

Does this happen also in FF26?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #18)
> Does this happen in FF26?

Yes. I can reproduce the hang in Firefox 25, 26, 27, and 28.
Here's a trace (all threads) of all three "invalid image size: 0 x 0" errors, plus a call to "Mono`GC_mark_from" that I don't fully understand.  (It's the undocumented CGPostError() method that displays the errors.)

All these calls are made from plugin code -- over which (of course) we don't have direct control.

But I'm going to devise an interpose library that hooks CGImageCreateWithImageProvider and stops it from being called with a zero-sized image.  If this stops these crashes, we'll have proved that this is an Apple bug, and have shown how we and others can work around it.
I should have mentioned that this trace was made using today's mozilla-central nightly.
(In reply to Steven Michaud from comment #20)
> Here's a trace (all threads) of all three "invalid image size: 0 x 0"
> errors, plus a call to "Mono`GC_mark_from" that I don't fully understand.

Unity uses the Mono .NET runtime for its scripting language.
> But at this point I think our best hope is to figure out what code
> is causing the error messages to be displayed.  By fiddling with
> that we may be able to work around this bug (which like I said is
> probably an Apple bug).

This turned out to be a red herring.

The "invalid image size: 0 x 0" error messages also get logged by the
other demos (which don't crash), and also appear on other versions of
OS X.  And avoiding them by hooking doesn't get rid of the crashes,
either.  (I ended up hooking CGImageCreateWithImageInRect(), which in
Unity plugin code sometimes gets called with rect.size.width or
rect.size.height set to '0'.  Avoiding these calls did stop the error
messages being displayed, but didn't stop the "crashes".)

Next I'm going to try to get a stack trace of whatever code logs the
"Stacktrace:" message.
> Next I'm going to try to get a stack trace of whatever code logs the
> "Stacktrace:" message.

I've found that this is logged from Mono code (the Mono bundle in the
Unity plugin) using a call to fwrite().  But the code from which this
happens (mono_handle_native_sigsegv) is a signal handler, which is the
only thing on the stack whenever it's called (always on the main
thread).  After the "Stacktrace:" message is logged, the
mono_handle_native_sigsegv method calls mono_jit_walk_stack_from_ctx,
but apparently this fails.

So it's pretty clear a crash is happening somewhere in Mono code,
which it doesn't handle entirely properly.  But there's not much more
we can say.  We pretty much have to hand this off to the Unity
developers.

Chris, would you mind opening a bug with Unity?  When/if you do, you
should probably refer to this bug.
As for the "hang", here's what I suspect is going on:

The mono_handle_native_sigsegv signal handler stops the Unity plugin from crashing, but it's effectively dead from this point -- so it stops handling IPC messages.  Firefox (the main process) notices this and eventually kills the Unity plugin after a timeout.

Sometimes Firefox takes a long time to notice that the Unity plugin is dead -- apparently it can die when no messages are "expected" from it.  But I find you can trigger Firefox's countdown by doing something that should trigger IPC messages -- for example changing the browser window's or app's focus.
When I run Firefox from Terminal and Firefox does manage to kill the Unity plugin, I see error messages like the following in Terminal:

###!!! [Parent][MessageChannel::Call] Error: Channel timeout: cannot send/recv
Mono uses sigaction() to set signal handlers, so in principle I should be able to use an interpose library to hook these calls and prevent Mono from installing a handler for SIGSEGV.  But Mono somehow prevents this method (and also fwrite()) from being hookable using an interpose library.

But a Unity developer could do this directly in Mono code.  The advantage would be that, without the signal handler, one could see exactly where the SIGSEGV crash is happening.
Oops, comment #27 is wrong -- I *can* hook Mono's calls to sigaction.  Hopefully I'll be able to post a crash stack in a bit.
What does mono_handle_native_sigsegv actually *do* instead of crashing?

This is very low priority, I don't think you should spend much more time on it. It seems that the Firefox plugin hang detector works properly and kills the plugin after 60 seconds, so we have a backstop for users.
I filed a Unity support request for this hang, Case #00138536: "Unity hang bug on OS X 10.9?"
> What does mono_handle_native_sigsegv actually *do* instead of crashing?

I have no idea.

> Hopefully I'll be able to post a crash stack in a bit.

Apparently not.  Stopping "Mono" and "UnityPlayer" from using sigaction to set signal handlers for SIGSEGV does stop the "hang" from happening -- the plugin process does die immediately.  But lldb doesn't give me a stack trace (for the crashing plugin-container process) -- it just tells me that the process has exited with status = 0.

Possibly the process died because of a SIGKILL or SIGSTOP, but I'll leave that to the Unity developers to figure out.

By the way, Benjamin, if I hadn't done my analysis, we wouldn't have a clue what the problem is here.  At least now we have something we can reasonably pass to the Unity developers.
Actually I was doing the hooking wrong.  If you do it right you *do* get a trace of the crash (the SIGSEGV access violation) in lldb.  And in fact we've already seen it in attachment 824863 [details] above:

Process 926 stopped
* thread #1: tid = 0x6e80, 0x144d6905 Mono`GC_mark_from + 1004, queue = 'com.apple.main-thread, stop reason = signal SIGSEGV
    frame #0: 0x144d6905 Mono`GC_mark_from + 1004
Mono`GC_mark_from + 1004:
-> 0x144d6905:  movl   4(%eax), %esi
   0x144d6908:  cmpl   %esi, -172(%ebp)
   0x144d690e:  ja     0x144d691c                ; GC_mark_from + 1027
   0x144d6910:  cmpl   %esi, -176(%ebp)
(lldb) bt all
* thread #1: tid = 0x6e80, 0x144d6905 Mono`GC_mark_from + 1004, queue = 'com.apple.main-thread, stop reason = signal SIGSEGV
    frame #0: 0x144d6905 Mono`GC_mark_from + 1004
    frame #1: 0x144d796a Mono`GC_mark_some + 466
    frame #2: 0x144cf6b6 Mono`GC_stopped_mark + 470
    frame #3: 0x144cfa9a Mono`GC_try_to_collect_inner + 351
    frame #4: 0x144cfe0b Mono`GC_try_to_collect + 136
    frame #5: 0x144cfe62 Mono`GC_gcollect + 26

...

I didn't previously understand why lldb stopped at Mono`GC_mark_from.  So now we know that's where the access violation is.
For those few intrepid Mozilla developers who realize the value of reverse engineering, and are interested in learning more about it, here's the interpose library I used to hook Mono's calls to sigaction().

interpose.mm contains instructions on how to build it, and various other explanatory comments.
Summary: [10.9] "Butterfly Demo" "hangs" then crashes the Unity plugin → [10.9] "Butterfly Demo" "hangs" then crashes the Unity plugin (after SIGSEGV at Mono:GC_mark_from + 1004 in plugin code)
So to sum up, here's what's happening:

1) An access violation takes place at Mono:GC_mark_from + 1004 in Unity plugin code.

2) A SIGSEGV handler, Mono:mono_handle_native_sigsegv (in plugin code) handles the signal without either properly working around the error or letting the plugin process die.

3) The plugin process is no longer able to handle IPC messages.  If/when the main process finds itself "expecting" one of these messages, it assumes the plugin process is hung and starts a timer.

4) Once the timer has expired, the main process kills the plugin process.
> it assumes the plugin process is hung and starts a timer.

It assumes the plugin process might be hung and starts a timer.
Crash Signature: [@ hang | libsystem_kernel.dylib@0x177ca ]
Whiteboard: [summary in comment #34]
Hi folks, this report just crossed my desk from our Support department. I've filed this into our internal system as case 574149 and marked it as a high priority for my team to fix.

Thanks for your diligence, I'll report back with our progress or questions.

Regards,
-- Ian Dundore
Webplayer Team Lead, Unity Development
See Also: → 946910
Hello,

 The core issue appears to be an incompatibility between the version of Mono in the 2.x Unity runtime and OSX 10.9 Mavericks, which causes the plugin to crash ungracefully. Given the complexity of fixing 2.x's Mono runtime, we've elected to block 2.x content on Mavericks, which will prevent this crash/hang from occurring.

 This fix is currently scheduled for release with Unity 4.5.

 Thanks for the report.

Regards,
-- Ian Dundore
Webplayer Team Lead, Unity Development
Thanks, Ian, for the fix and the information.

Could you give a rough estimate when Unity 4.5 will be released?
Hi Steven,

 Unfortunately, I really can't make a reliable estimate as Unity 4.5 is in alpha-testing right now. It will be early next year.

Regards,
-- Ian Dundore
Webplayer Team Lead, Unity Development
I can no longer repro this crash with UnityPlayer version 4.3.5f1 (on Nightly 31).
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: