Closed Bug 1117027 Opened 10 years ago Closed 10 years ago

Unity DT2 demo crashes on first click with e10s disabled on OS X

Categories

(Core :: DOM: UI Events & Focus Handling, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
e10s - ---
firefox35 --- unaffected
firefox36 --- unaffected
firefox37 + fixed
firefox38 + fixed

People

(Reporter: brendan, Assigned: smichaud)

References

(Blocks 1 open bug, )

Details

(Keywords: regression)

Attachments

(3 files, 1 obsolete file)

Four crash reports, last from very latest Nightly (December 31 tip), the other three from the one I had fired up to test tonight, which was built from a December 25 tip:

bp-e9d67f16-5170-484e-aba4-579e52150102	1/1/15	9:00 PM
bp-1f1e5a53-7ffd-4e83-aef0-2379a2150102	1/1/15	8:57 PM
bp-0fdee63f-4e57-4d30-beb8-caa4b2150102	1/1/15	8:47 PM
bp-8cdb5f26-f6ba-475a-a9f6-6da722150102	1/1/15	8:45 PM

In a (default) e10s window, DT2 is still unplayable, the trackpad is very hard to control (perhaps not as bad as last year, but unplayable).

Cc'ing a few folks. Recent-ish regression, I demo'ed DT2 successfully in London at NDC, early last month.

Event Handling as starting component but probably wrong. Please fix! Thanks.

/be
> In a (default) e10s window, DT2 is still unplayable, the trackpad is very hard to control (perhaps not as bad as last year, but unplayable).

This is me being unclear: the default e10s window loads DT2 and does not crash on first click. As noted, the non-e10s window is the only way to get a playable trackpad, and the click is required to start tracking (workaround for bug in older Unity build that won't be fixed in this hosted demo, which will be fixed for the DT2 release).

/be
Stack in those crash reports (0 or 1 frames) are perhaps suspicious; common 0x5f3ffff8 may be of interest.
er, common *crash address* 0x5f3ffff8 may be of interest.
I suspect all the crashes are stack overflows, despite EXC_BAD_ACCESS crash reason.  RSP is 0x00007fff5f400000 in all cases; seems too round to be a coincidence.  I don't see memory regions in the dump to verify, though.

That means it would be good to get a real stack.
Perhaps OS X specific, seems ok on linux.
Blocks: gecko-games
As dbaron surmised...

Who can take this bug?

/be
Looks like http://mxr.mozilla.org/mozilla-central/source/dom/events/EventStateManager.cpp?rev=e7b88b459b2a#3975 should deal with this kind of recursion.

I don't have a OSX machine to reproduce. Perhaps cpearce can reproduce.
Flags: needinfo?(cpearce)
(In reply to Olli Pettay [:smaug] from comment #7)
> Looks like
> http://mxr.mozilla.org/mozilla-central/source/dom/events/EventStateManager.
> cpp?rev=e7b88b459b2a#3975 should deal with this kind of recursion.

Thanks -- any idea what regressed this? Something that landed in the last month.

/be
Version: unspecified → Trunk
The regression is possibly something OSX specific, the stuff happening under
widget->SynthesizeNativeMouseMove (which calls SynthesizeNativeMouseEvent)

http://hg.mozilla.org/mozilla-central/annotate/57e4e9c33bef/widget/cocoa/nsChildView.mm#l1128
Can't repro on Windows here. Someone who has a clue about MacOSX would be best suited to fixing this.
Flags: needinfo?(cpearce)
Flags: needinfo?(smichaud)
Is Steven really the only Mac-enabled, likely-to-diagnose-this-bug developer around?

/be
Sorry to "Dear LazyWeb" -- I meant maybe someone could bisect. I'll do it when I have time, unless someone beats me to it. Thanks,

/be
I can reproduce this crash, but only on my main HiDPI screen.  If I open the game on my external monitor, it works just fine.
I'm on OSX 10.9.5.
I can also reproduce this with e10s disabled entirely, as you'd expect.  There's no additional useful console spam in a debug build, though I did hit a corrupted actor state abort in PLayer in the non-e10s window in e10s mode scenario.
Summary: Unity DT2 demo in non-e10s window crashes on first click → Unity DT2 demo in crashes on first click with e10s disabled
Andrew, could you give detailed STR?  (Unless just running the demo and clicking once on it is enough to crash.)

I'll also look for a regression range.

The most recent big Mac-specific plugin change is the patch for bug 1092630.
Flags: needinfo?(smichaud)
1. Load page, then wait for it to finish downloading.
2. Click on either "play" button.
3. Wait until it goes full-screen.  Sometimes I clicked on the "ok" for full screen permission, but I think sometimes not.
4. Click using the trackpad for an immediate crash.

It is easy to trigger.  The only caveat is, as I said, that I couldn't reproduce on an external monitor.  I was able to reproduce on the main HiDPI screen, with or without an external monitor plugged in.
(In reply to comment #17)

Yup, that does it for me (I crash, with today's m-c nightly in non-e10s mode) on my Retina MacBook Pro (with no external monitor attached) with all of OS X 10.7.5, 10.9.5 and 10.10.1.  (The only reason I didn't test OS X 10.8.5 is that I don't have it installed there.)

I'll get and post one or more all-thread stack gdb/lldb stack traces.

It's very likely the trigger is bug 1092630 ... but we don't yet know for sure.
One extremely puzzling thing is that I don't see a Firefox Unity plugin process while the demo is running.  Do you know what's up with that, Brendan?
Flags: needinfo?(brendan)
And the demo works with the Unity plugin disabled.  So apparently it doesn't use the Unity plugin at all.
Yes, I believe this is a demo of a Unity-engine game compiled to ASM.js.  It may use Flash or something for sounds, I'm not sure what the state of that is, but otherwise no plugins.
I don't need to post a gdb/lldb stack trace, since Brendan has already done so in comment #6.

It looks like EventStateManager::GenerateMouseEnterExit() is re-entered in an infinite loop until we run out of stack space.  So, though it may be easier to trigger this bug on OS X, it's probably ultimately a bug in cross-platform event handling code.
Thanks for the info, Andrew.

> It may use Flash or something for sounds

It doesn't.  I don't see any Firefox plugin processes at all.
Flags: needinfo?(brendan)
(In reply to Steven Michaud from comment #23)
> Thanks for the info, Andrew.
> 
> > It may use Flash or something for sounds
> 
> It doesn't.  I don't see any Firefox plugin processes at all.

Correct, it uses Web Audio for sound, and the engine itself is asm.js. No plugins here.
Here's the regression range:

firefox-2014-12-11-03-02-06-mozilla-central
firefox-2014-12-12-03-02-01-mozilla-central

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=0cf461e62ce5&tochange=5288b15d22de
> It's very likely the trigger is bug 1092630

Bizarrely I turned out to be right -- even though we now know this bug has nothing to do with plugins.

I don't think it can be the cause, though.  As I said above, that's probably a cross-platform bug.

Since this bug doesn't (probably) require any Mac-specific wizardry to fix, is it possible that someone else can take it? :-)
Depends on: 1092630
http://hg.mozilla.org/mozilla-central/rev/a6db8f54f5aa

 Bug 1092630: Get rid of native widgets for OS X NPAPI plugins, make things work much better under e10s. Patch by Josh Aas, Markus Stange, Steven Michaud, David Parks. r=smichaud/jst/josh (more reviews pending)
author	Josh Aas <joshmoz@gmail.com>
	Thu Dec 11 08:44:07 2014 -0600 (at Thu Dec 11 08:44:07 2014 -0600)
That patch did change the coordinate handling in SynthesizeNativeMouseEvent
[Tracking Requested - why for this release]: Breaks an asm.js game demo
(As an aside, I'd guess the demo not working with e10s is bug 1044228, because with the test case in that bug I get the same weird behavior where the position skips all over the place as I get in this demo.)
> That patch did change the coordinate handling in SynthesizeNativeMouseEvent

So it did, here:

https://hg.mozilla.org/mozilla-central/annotate/912036eeb024/widget/cocoa/nsChildView.mm#l1134

And I now realize the comment is wrong.

I'll post a patch reversing this part of the patch for bug 1092630, and ask people to test it.
Summary: Unity DT2 demo in crashes on first click with e10s disabled → Unity DT2 demo in crashes on first click with e10s disabled on OS X
> I'll post a patch reversing this part of the patch for bug 1092630, and ask people to test it.

It doesn't work locally (it causes a different problem), so I won't do this.

But in any case it's clear that at least some parts of Josh's patch for bug 1092630 make wrong assumptions.  And I've covered a lot of this ground before myself.  So (sigh) I'm probably the best one to take this bug.
Assignee: nobody → smichaud
Status: NEW → ASSIGNED
Summary: Unity DT2 demo in crashes on first click with e10s disabled on OS X → Unity DT2 demo crashes on first click with e10s disabled on OS X
I've got a patch that seems to fix both this bug and bug 1112267.  Before I post it I want to see the results of a tryserver run.  But our build infrastructure seems to have crapped out for the time being, so that will probably have to wait for tomorrow.

In fact my patch is just what I described in comment #31.  But I wasted many hours before discovering that this bug's testcase won't run in self builds.  It only runs in "official" releases, nightlies and tryserver builds.  In other builds the testcase's asm.js code seems to go into an infinite loop, and you eventually get the "unresponsive script" warning.  I spent more time trying to figure out why -- to figure out what build options I'd need to add to my mozconfig files (or take away from them).  I still haven't managed to do this.

Brendan and Alon (or anybody else), do you have any idea why the demo doesn't work in my self builds?
Flags: needinfo?(brendan)
Flags: needinfo?(azakai)
I'm able to run the test case in the URL in my own regular optimized debug build on OSX, with clang "Apple LLVM version 6.0 (clang-600.0.56) (based on LLVM 3.5svn)".  It was a bit pokey starting up, but not more than ten or so seconds, and I didn't see a slow script dialogue.
Andrew, could you attach your mozconfig?

I hadn't thought to change what version of clang I use.  I'll do that (by building on a different version of OS X) and see if it makes a difference.  I've been building on OS X 10.7.5, with "Apple LLVM version 4.2 (clang-425.0.28) (based on LLVM 3.2svn)".
Attached file mozconfig β€”
Yup, it seems to be the version of clang you use that makes the difference (or possibly the version of OS X you build on):  A self build done on OS X 10.8.5, using "Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)", works just fine.

Weird.
Flags: needinfo?(brendan)
Flags: needinfo?(azakai)
I'd like to see the potential patch, no need to wait to verify that it's correct. Can you post it?
Attached patch Fix (obsolete) β€” β€” Splinter Review
Here's my patch, Josh.  I just reversed your change to nsChildView::SynthesizeNativeMouseEvent().

Your comment is wrong.  Cocoa "points" (which elsewhere get called "display pixels") are *not* (in general) the same as Gecko "device pixels".  "Device pixels" are actual pixels, which are not (in the general case) directly addressable.  "Display pixels" *are* directly addressable, using methods that take parameters like NSPoint and NSRect.

In non-HiDPI mode "device pixels" and "display pixels" are physically the same.  But not in HiDPI mode, where a "display pixel" equals four "device pixels".

Yes, it's horribly confusing.  But then so is HiDPI mode itself.
Attachment #8546757 - Flags: review?(joshmoz)
For the record, here's the results of my tryserver run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0d7659bc48e

There were no non-spurious test failures.

And here's the opt tryserver build:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-f0d7659bc48e/try-macosx64/firefox-37.0a1.en-US.mac.dmg
Attachment #8546757 - Flags: review?(joshmoz) → review?(mstange)
Comment on attachment 8546757 [details] [diff] [review]
Fix

Review of attachment 8546757 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like David Parks introduced this change in bug 1092630 comment 12 in order to fix test failures. Do tests still pass with this patch?

Do you know how this bug caused the crash?
Attachment #8546757 - Flags: review?(mstange) → review+
Oh, you already ran tests and they were green. Hmm. And our test machines don't run on HiDPI machines, so David wouldn't have seen the test failure on TBPL to begin with.

David, did you run the tests on a HiDPI machine when you made the change to nsChildView::SynthesizeNativeMouseEvent? Do you know which test failure you were fixing?
Flags: needinfo?(davidp99)
(In reply to Markus Stange [:mstange] from comment #42)

> David, did you run the tests on a HiDPI machine when you made the change to
> nsChildView::SynthesizeNativeMouseEvent? Do you know which test failure you
> were fixing?

Both David and I were running tests locally and trying to fix bugs with HiDPI. We should definitely make sure this doesn't regress local HiDPI tests.
(In reply to Markus Stange [:mstange] from comment #42)
> David, did you run the tests on a HiDPI machine when you made the change to
> nsChildView::SynthesizeNativeMouseEvent? Do you know which test failure you
> were fixing?

Yes, I'm on a retina MBP -- this work was partially about fixing existing test failures on HiDPI devices... just because thats the kind of machine I use and the tests were the best way to confirm the rest of the plugin work.  Its entirely possible that I made the wrong call about the function's expected coordinate system.

I would have changed it to fix, on HiDPI mac, one of three tests:

test_plugin_mouse_coords.html
test_cocoa_focus.html
test_cocoa_window_focus.html

It was likely the first one...  If reverting the change maintains green, then go for it.
Flags: needinfo?(davidp99)
David's change to nsChildView::SynthesizeNativeMouseEvent() is clearly wrong, and my reversion of it is clearly correct.  But other changes may also be required, including to some of the tests.  I'll look into this over the next few days, and only then land my patch.
(In reply to comment #41)

> Do you know how this bug caused the crash?

No, not in detail.  David's change to nsChildView::SynthesizeNativeMouseEvent() is so clearly wrong that I think fixing it is enough (as long as it also fixes this bug, which it does).  But, like I said above, some tests may also need to be fixed.

I may very well be right that the stack exhaustion is also a bug in cross-platform code.  But I'll leave that for someone else to fix, if/when it comes up again.
(In reply to comment #44)

> test_plugin_mouse_coords.html

I get the following failures in this test when I run it locally (on my Retina MacBook Pro) in current trunk code (i.e. without my patch for this bug):

36 INFO TEST-UNEXPECTED-FAIL | layout/generic/test/test_plugin_mouse_coords.html | p1 mouse move Y - got 18, expected 38
37 INFO TEST-UNEXPECTED-FAIL | layout/generic/test/test_plugin_mouse_coords.html | p2 mouse move Y - got 18, expected 38
38 INFO TEST-UNEXPECTED-FAIL | layout/generic/test/test_plugin_mouse_coords.html | p3 mouse move Y - got 11, expected 31

David, do you also see these failures when you run this test locally (on a Retina MacBook Pro)?  On which version of OS X do make the build and run the test?  What kind of build is it, generally?

I built a non-opt non-debug build on OS X 10.7.5, and also tested there.
Flags: needinfo?(davidp99)
(Following up comment #47)

Interestingly, these failures only happen (in current trunk code) with a 64-bit build -- not with a 32-bit build.
(In reply to Steven Michaud from comment #47)
> (In reply to comment #44)
> 
> David, do you also see these failures when you run this test locally (on a
> Retina MacBook Pro)?  On which version of OS X do make the build and run the
> test?  What kind of build is it, generally?
> 
> I built a non-opt non-debug build on OS X 10.7.5, and also tested there.

The test passes fine for me (using a plain trunk build - i.e. not including the patch in this bug) on an OSX 10.9 retina MBP.  about:buildconfig says x86_64.
Flags: needinfo?(davidp99)
PS: It was a debug build:
> ac_add_options --enable-debug
Attached patch Fix plus test change β€” β€” Splinter Review
I found out which test was failing for you, David, and how to fix it:

> test_cocoa_focus.html

The test as it stood didn't properly support HiDPI mode.  Now (with my changes) it does.

This test failed in local builds done on both OS X 10.7.5 and 10.9.5, with "--enable-debug" on or off.  With my changes it always succeeds (using my local builds).
Attachment #8546757 - Attachment is obsolete: true
Attachment #8552032 - Flags: review?(mstange)
Comment on attachment 8552032 [details] [diff] [review]
Fix plus test change

Review of attachment 8552032 [details] [diff] [review]:
-----------------------------------------------------------------

Perfect!
Attachment #8552032 - Flags: review?(mstange) → review+
Comment on attachment 8552032 [details] [diff] [review]
Fix plus test change

Landed on mozilla-inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/953db4d13beb
https://hg.mozilla.org/mozilla-central/rev/953db4d13beb
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Comment on attachment 8552032 [details] [diff] [review]
Fix plus test change

Approval Request Comment
[Feature/regressing bug #]: 1092630
[User impact if declined]: Pointer locking broken, among other things
[Describe test coverage new/current, TreeHerder]: Manual testing and brief baking on m-c
[Risks and why]: Low risk -- fixes obvious mistakes in our code
[String/UUID change made/needed]: None
Attachment #8552032 - Flags: approval-mozilla-aurora?
Attachment #8552032 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8552032 [details] [diff] [review]
Fix plus test change

Landed on mozilla-aurora:
https://hg.mozilla.org/releases/mozilla-aurora/rev/8a791f4daca
Component: Event Handling → User events and focus handling
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: