Closed Bug 776132 Opened 8 years ago Closed 8 years ago

UI Test app - window.close test causes seg fault when run OOP (Otoro, debug-build only)

Categories

(Firefox OS Graveyard :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(blocking-basecamp:-)

RESOLVED FIXED
blocking-basecamp -

People

(Reporter: dhylands, Assigned: justin.lebar+bug)

References

(Blocks 1 open bug)

Details

(Whiteboard: [LOE:S])

Attachments

(2 files, 1 obsolete file)

UI Test App, window.close test causes a seg fault in the plugin-container process.

Backtrace is attached.
All core apps need to run OOP
blocking-basecamp: --- → ?
blocking-basecamp: ? → +
Depends on: 782971
I cannot reproduce this.  Can you, Dave?

(Note that although only the header of the UI test app is drawn, you can still click the buttons to open up the window.open test page; you just have to search around a bit for the right button, since it's invisible.)
I'm still seeing a crash (on my otoro), but it's slightly different now.

STR:
1 - Make sure UI Test isn't blacklisted. Edit gaia/apps/system/js/window_manager.js around line 463 and comment out the 'UI test' line. Run make install-gaia from inside the gaia directory.

2 - Run UI Tests
3 - Verify that UI Tests was actually run OOP, by observing the line:

Content JS INFO at app://system.b2g.hylands.org/js/window_manager.js:481 in appendFrame: %%%%% Launching UI tests as remote (OOP)

in logcat.

4 - Click on window.close button. (I see all of the buttons) I now see the following:

[Child 358] ###!!! ABORT: __delete__()d actor: file /home/work/B2G-otoro/objdir-gecko-debug-git/ipc/ipdl/PHal.cpp, line 28
Assignee: nobody → justin.lebar+bug
Jason, since I can't reproduce this bug, I don't feel comfortable having it assigned to me at the moment.

I know we want to make sure that all blockers are owned, but if I don't consent to having a bug assigned to me, I may just ignore it, and then that makes it really difficult to keep track of which bugs I've actually assigned to myself, and which are assigned to me just for show.
Assignee: justin.lebar+bug → nobody
I just tested on my Otoro with the str in comment 3, now that I have a working build on my device.  I see

> E/GeckoConsole(  116): Content JS INFO at app://system.gaiamobile.org/js/window_manager.js:503 in 
> appendFrame: %%%%% Launching UI tests as remote (OOP)

Then I click the window.close button.  The window closes, and I do not see "ABORT" anywhere in my log.

I'm at Gecko rev 35819088b27, from today, Friday, Aug 17.

If you see all of the buttons in the UI test app, perhaps you're running an old version of Gecko.  It's also interesting that my log shows "app://system.gaiamobile.org", while yours shows "app://system.b2g.hylands.org" -- perhaps that reflects a difference in our setups, or in our builds.

It could be that this only appears with ICS otoros -- mine is GB.  But I seriously doubt that's what's going on here...
It at the same git revision.

I'm seeing all of the buttons and the crash.

I'm running ICS based otoro.

The b2g.hylands.org is because I set GAIA_DOMAIN to b2g.hylands.org (some non-existant URL)
And I'm also doing a debug build (i.e. B2G_DEBUG=1).

So the only customizations I have in my build are B2G_DEBUG=1 and GAIA_DOMAIN=b2g.hylands.org
It's likely the debug build that's causing it.
I'll do a non-debug build and retest and see what happens
In a non-DEBUG build, it doesn't crash for me, and I see no messages to logcat when doing the window.close().

In a DEBUG build, I see th ABORT/seg-fault.
Summary: UI Test app - window.close test causes seg fault when run OOP → UI Test app - window.close test causes seg fault when run OOP (Otoro, debug-build only)
I'll see what I can do about fixing this, but I don't think debug-only issues should block.
blocking-basecamp: + → ?
Assignee: nobody → justin.lebar+bug
Well, it only detects the problem in debug because the detection code doesn't run in release. The problem is almost certainly still present in the release code, and just not manifesting itself anywhere yet.
Took a look at the backtrace ... looks eerily familiar.
Yes, in fact: bug 774606 comment 1, and then followup bug 774606.  Looked message-manager related.
I see a slightly different sequence of events leading up to the segfault:

E/GeckoConsole(  209): [JavaScript Error: "NS_ERROR_INVALID_POINTER: Component returned failure code: 0x80004003 (NS_ERROR_INVALID_POINTER) [nsIFrameMessageManager.removeMessageListener]" {file: "chrome://browser/content/forms.js" line: 155}]
E/GeckoConsole(  209): [JavaScript Error: "NS_ERROR_INVALID_POINTER: Component returned failure code: 0x80004003 (NS_ERROR_INVALID_POINTER) [nsIFrameMessageManager.removeMessageListener]" {file: "chrome://browser/content/forms.js" line: 155}]
I/Gecko   (  209): [Child 209] WARNING: Attempt to get screen rotation through nsIScreen::GetRotation().  Nothing should know or care this in sandboxed contexts.  If you want *orientation*, use hal.: file ../../../gecko/widget/xpwidgets/PuppetWidget.cpp, line 623
I/Gecko   (  209): [Child 209] ###!!! ABORT: not on worker thread!: 'mWorkerLoop == MessageLoop::current()', file ../../dist/include/mozilla/ipc/AsyncChannel.h, line 207
F/libc    (  209): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1)
That's a different error.  Looks like we're trying to send IPC after ActorDestroy(), from somewhere.  The messageManager stuff is the same though.
Finally got a stack for the crash I'm seeing (which, as we say, may or may not be related to dhylands's original stack)

> mozilla::ipc::AsyncChannel::AssertWorkerThread() const (/home/jlebar/code/moz/B2G/objdir-gecko-debug/ipc/glue/../../dist/include/mozilla/ipc/AsyncChannel.h:208)
> CxxStackFrame (/home/jlebar/code/moz/B2G/objdir-gecko-debug/ipc/glue/../../dist/include/mozilla/ipc/RPCChannel.h:249)
> mozilla::ipc::RPCChannel::Send(IPC::Message*) (/home/jlebar/code/moz/B2G/objdir-gecko-debug/ipc/glue/../../../gecko/ipc/glue/RPCChannel.cpp:110)
> mozilla::hal_sandbox::PHalChild::SendEnableSensorNotifications(mozilla::hal::SensorType const&) (/home/jlebar/code/moz/B2G/objdir-gecko-debug/ipc/ipdl/PHalChild.cpp:1024)
> mozilla::hal_sandbox::EnableSensorNotifications(mozilla::hal::SensorType) (/home/jlebar/code/moz/B2G/objdir-gecko-debug/hal/../../gecko/hal/sandbox/SandboxHal.cpp:213)
> mozilla::hal::EnableSensorNotifications(mozilla::hal::SensorType) (/home/jlebar/code/moz/B2G/objdir-gecko-debug/hal/../../gecko/hal/Hal.cpp:438)
> mozilla::hal::RegisterSensorObserver(mozilla::hal::SensorType, mozilla::Observer<mozilla::hal::SensorData>*) (/home/jlebar/code/moz/B2G/objdir-gecko-debug/hal/../../gecko/hal/Hal.cpp:469)
> OrientationObserver::EnableAutoOrientation() (/home/jlebar/code/moz/B2G/objdir-gecko-debug/widget/gonk/../../../gecko/widget/gonk/OrientationObserver.cpp:256)
> OrientationObserver (/home/jlebar/code/moz/B2G/objdir-gecko-debug/widget/gonk/../../../gecko/widget/gonk/OrientationObserver.cpp:177)
> OrientationObserver::GetInstance() (/home/jlebar/code/moz/B2G/objdir-gecko-debug/widget/gonk/../../../gecko/widget/gonk/OrientationObserver.cpp:162)
> nsAppShell::Exit() (/home/jlebar/code/moz/B2G/objdir-gecko-debug/widget/gonk/../../../gecko/widget/gonk/nsAppShell.cpp:532)
> nsBaseAppShell::Observe(nsISupports*, char const*, unsigned short const*) (/home/jlebar/code/moz/B2G/objdir-gecko-debug/widget/xpwidgets/../../../gecko/widget/xpwidgets/nsBaseAppShell.cpp:419)
> nsVoidArray::Count() const (/home/jlebar/code/moz/B2G/objdir-gecko-debug/xpcom/ds/../../dist/include/nsVoidArray.h:37)
> nsObserverService::NotifyObservers(nsISupports*, char const*, unsigned short const*) (/home/jlebar/code/moz/B2G/objdir-gecko-debug/xpcom/ds/../../../gecko/xpcom/ds/nsObserverService.cpp:152)
> mozilla::ShutdownXPCOM(nsIServiceManager*) (/home/jlebar/code/moz/B2G/objdir-gecko-debug/xpcom/build/../../../gecko/xpcom/build/nsXPComInit.cpp:582)
> NS_ShutdownXPCOM_P (/home/jlebar/code/moz/B2G/objdir-gecko-debug/xpcom/build/../../../gecko/xpcom/build/nsXPComInit.cpp:542)
> XRE_TermEmbedding (/home/jlebar/code/moz/B2G/objdir-gecko-debug/toolkit/xre/../../../gecko/toolkit/xre/nsEmbedFunctions.cpp:197)
> mozilla::ipc::ScopedXREEmbed::Stop() (/home/jlebar/code/moz/B2G/objdir-gecko-debug/ipc/glue/../../../gecko/ipc/glue/ScopedXREEmbed.cpp:95)
> mozilla::dom::ContentProcess::CleanUp() (/home/jlebar/code/moz/B2G/objdir-gecko-debug/dom/ipc/../../../gecko/dom/ipc/ContentProcess.cpp:32)
> XRE_InitChildProcess (/home/jlebar/code/moz/B2G/objdir-gecko-debug/toolkit/xre/../../../gecko/toolkit/xre/nsEmbedFunctions.cpp:490)
Heh, so this code

  NS_IMETHODIMP
  nsAppShell::Exit()
  {
    OrientationObserver::GetInstance()->DisableAutoOrientation();

ends up, as you can see, calling /Enable/AutoOrientation.  Go us.
It looks like GetInstance() constructs a new OrientationObserver which is what's calling EnableAutoOrientation
Attached patch Patch, v1 (obsolete) — Splinter Review
The code to disable the orientation observer isn't needed in the appshell -- the orientation observer is already cleared on shutdown, and that disables the observer.
(In reply to Dave Hylands [:dhylands] from comment #20)
> It looks like GetInstance() constructs a new OrientationObserver which is
> what's calling EnableAutoOrientation

Right.  Normally it would return an existing singleton instead of creating a new instance, but at this point, the ClearOnShutdown code has already run, clearing our old singleton.  So we make a new one and crash.
Comment on attachment 644491 [details]
Backtrace from UI Test window.close seg fault

This stacktrace is obsolete -- dhylands and I now both see the the one from comment 18.
Blocks: 740719
Comment on attachment 652983 [details] [diff] [review]
Patch, v1

How do I ensure that bug 740719 is not regressed with this patch?
Attachment #652983 - Flags: review?(jones.chris.g)
I don't understand why bug 740719 added this orientation observer change.  Does the orientation observer need to be shut down /before/ the appshell, or something?  We can ensure that if necessary, but not with the existing code!
The reason this doesn't happen in opt builds is that we _exit(0) instead of xpcom shutdown.

But, ARGGGGHHH!  This shouldn't be running at all in content processes!  Fucking appshell.  I stepped on this very same class of bug in bug 776835, except with android.

The best way to fix this is to make a content-process AppShell to go along with our content-process Widget.  That shouldn't be too bad, but I'm not sure how your widget-fu is.

If you make this hack content-process-only, you will be guaranteed not to regress the other bug.
> If you make this hack content-process-only, you will be guaranteed not to regress the other bug.

It's not clear to me that this is right, though, even in the master process.

I'll rewrite the patch to add a "shut me down" method to the orientation observer that is guaranteed not to /create/ the orientation observer.
Attached patch Patch, v2Splinter Review
Attachment #652983 - Attachment is obsolete: true
Attachment #652983 - Flags: review?(jones.chris.g)
Attachment #652994 - Flags: review?(jones.chris.g)
(In reply to Justin Lebar [:jlebar] from comment #27)
> > If you make this hack content-process-only, you will be guaranteed not to regress the other bug.
> 
> It's not clear to me that this is right, though, even in the master process.
> 

Oh --- wrt comment 25?  The problem was that as long as there were sensor listeners alive, we kept the sensor device active.  This call ensured that the widget listener was cleared.

> I'll rewrite the patch to add a "shut me down" method to the orientation
> observer that is guaranteed not to /create/ the orientation observer.

I don't completely understand this, but we shouldn't even attempt to run any of this auto-orientation code in content processes or we're going to screw lots of things up.
Today I was talking to Marshall Law and he was looking into a problem where the orientation sensor was getting disable one too many times. This may very well be related.
> The problem was that as long as there were sensor listeners alive, we kept the sensor device 
> active.  This call ensured that the widget listener was cleared.

Yeah, but we call ClearOnShutdown on the OrientationObserver, and that call should cause the destructor to run and cause DisableAutoOrientation to run as well.

However, it looks like KillClearOnShudown won't run until all threads other than the main thread have shut down.  If the orientation observer is keeping a thread alive, then that would explain the hang.
Attachment #652994 - Flags: review?(jones.chris.g) → review+
Whiteboard: [LOE:S]
https://hg.mozilla.org/mozilla-central/rev/a6ab6aaa8aa0
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
blocking-basecamp: ? → -
You need to log in before you can comment on or make changes to this bug.