Closed Bug 1384336 Opened 2 years ago Closed 2 years ago

Avoid running the appshell in content processes

Categories

(Core :: DOM: Content Processes, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: billm, Assigned: billm)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 3 obsolete files)

Having to run the appshell in content processes makes Quantum DOM a lot harder. It's also mostly useless, and it probably is a small drag on our performance. And it makes sandboxing harder (see bug 1381022).

I'm trying to make this bug pretty tightly scoped: rather than remove the appshell entirely from the content process, I just want to avoid calling the nsIThreadObserver callbacks for it. That will allow some of the widget-specific code that gets initialized in appshell to continue working. But it means we won't be using OS-specific event waiting mechanisms, which is what I really want to avoid.
How does this seem to you, Olli?
Attachment #8890124 - Flags: review?(bugs)
Attached patch accessibility changes (obsolete) — Splinter Review
Our accessibility code wants relies on us being in an "alertable state" (as defined by Win32) while processing events since that allows us to process APCs. I tried getting rid of the APC mechanism entirely and just using events, but that caused deadlocks if the main thread is blocked in EnsureMTA. So this patch does both, as we discussed at the workweek, Aaron.

If we decide to go this route, I think there's a bit more cleanup that could be done. However, I want to get some feedback before doing that.

These patches are green on Linux and Windows. I haven't tried Mac yet.
Attachment #8890125 - Flags: feedback?(aklotz)
Comment on attachment 8890124 [details] [diff] [review]
don't register thread observer in child process

Is this ok for all the non-parent processes?

I think someone familiar with our plugin processes and gpu processes should take a look too.

But if we can do this, great!
Attachment #8890124 - Flags: review?(bugs) → review+
Comment on attachment 8890125 [details] [diff] [review]
accessibility changes

Review of attachment 8890125 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great, Bill!

FYI, with respect to additional cleanup work and a11y, we can do some stuff in MessageChannel::WaitForSyncNotifyWithA11yReentry once the main event loop changes are done.
Attachment #8890125 - Flags: feedback?(aklotz) → feedback+
(In reply to Olli Pettay [:smaug] (vacation-ish until July 30) from comment #3)
> Is this ok for all the non-parent processes?
> 
> I think someone familiar with our plugin processes and gpu processes should
> take a look too.

I can't comment on gpu or GMP processes, but we probably can't in the NPAPI plugin container. I suspect that Flash assumes a fully-fledged Windows message pump in there.
(Unless we're already using something different down there... I can't remember)
Blocks: 1385014
(In reply to Aaron Klotz [:aklotz] (a11y work receiving priority right now, please send interceptor reviews to dmajor or handyman) from comment #4)
> Comment on attachment 8890125 [details] [diff] [review]
> accessibility changes
> FYI, with respect to additional cleanup work and a11y, we can do some stuff
> in MessageChannel::WaitForSyncNotifyWithA11yReentry once the main event loop
> changes are done.

I filed bug 1385014 for this.
OK, here's a cleaner patch.
Attachment #8890125 - Attachment is obsolete: true
Attachment #8891509 - Flags: review?(aklotz)
Attachment #8891509 - Flags: review?(aklotz) → review+
Ugh, this is totally orange on Mac.
I don't know the appshell code very well, but please let me know if there's anyway I can help out with the mac bits.
This patch is similar to the previous one: it disables the appshell's event loop code in the content process. However, I made it pref-controlled. I also needed some extra changes for Mac, since the MacOS appshell code is a little different.

Stephen, could you review this? The main question here is whether there's any need to run a native event loop in the content process on Macs.
Attachment #8890124 - Attachment is obsolete: true
Attachment #8893563 - Flags: review?(spohl.mozilla.bugs)
Comment on attachment 8893563 [details] [diff] [review]
don't register thread observer in child process, v2

(In reply to Bill McCloskey (:billm) from comment #11)
> The main question here is whether there's
> any need to run a native event loop in the content process on Macs.

Forwarding to mstange to make sure.
Attachment #8893563 - Flags: review?(spohl.mozilla.bugs) → review?(mstange)
Things that would need the native event loop in the content process are:
 (1) Callers that use [someObjCObject performSelector:@selector(someSelector) afterDelay:someDelay].
 (2) Notification observers, e.g. things that listen for changes to system settings.

I don't see either of those things used in any code that runs in the content process. That's great! I was expecting more problems here.

Looking through nsAppShell.mm, I also noticed two other things: HangReporter integration, and MacWakeLockListener. We need to make sure that this patch doesn't cause hang reporting to detect the content process as hanging all the time. And the fact that we currently set up MacWakeLockListeners in the content process is probably just a bug.

I also checked what kinds of events go through -[GeckoNSApplication sendEvent:] in the content process at the moment. I've only seen two event types: NSApplicationDefined events, which we use internally to wake up the event loop, and NSSystemDefined events which seem to be firing system-wide when certain things happen. For some of them I wasn't able to determine a cause. Some seem to fire whenever you use the dictionary popup anywhere on the system.

I'm excited about this change and I think it's time that we made it. Sometimes I see overhead from the native event loop in profiles, and somethings things related to -[NSApplication setWindowsNeedUpdate:] which I don't understand, and this change should reduce that overhead.

I've briefly tested this patch now, and the only problem I've found is that Activity Monitor now says our child processes are "Not Responding". We should fix that. There's a Chrome bug about a similar problem with some useful information, it gets interesting around here: https://bugs.chromium.org/p/chromium/issues/detail?id=304860#c21
(In reply to Markus Stange [:mstange] from comment #13)
> Looking through nsAppShell.mm, I also noticed two other things: HangReporter
> integration, and MacWakeLockListener. We need to make sure that this patch
> doesn't cause hang reporting to detect the content process as hanging all
> the time. And the fact that we currently set up MacWakeLockListeners in the
> content process is probably just a bug.

Looking at the patch again, I can see that you're not changing anything about the MacWakeLockListener situation, so let's not worry about that.
In bug 1386308 I removed the use of MacWakeLockListener in content.
I'm afraid I didn't understand that Chrome bug. Is there something we could do to fix it?
(In reply to Alex Gaynor [:Alex_Gaynor] from comment #15)
> In bug 1386308 I removed the use of MacWakeLockListener in content.

Oh, good.

(In reply to Bill McCloskey (:billm) from comment #16)
> I'm afraid I didn't understand that Chrome bug. Is there something we could
> do to fix it?

I think we need to find out where we trigger a call to _RegisterApplication and avoid making that call. The chromium issue points out process renaming as one source of these calls. The general idea is: if the WindowServer has never heard from our process, it won't care about whether we're responding. So we need to make sure that we never communicate with it in the first place.
Not sure if it'd solve this problem, but bug 1322024 deals with blocking new connections to the windoserver at the sandbox level -- as long as we don't obtain a connection to the windowserver before the sandbox is enabled, then it'd become impossible for us to ever talk to the windowserver. (If you find any places where we're talking to the windowserver, I guess they're blockers for that bug!)
The patch in bug 1322024 doesn't fix the "not responding" problem. It seems like the window server connection is probably created before the sandbox is initialized. I set a breakpoint in _RegisterApplication and saw it being called from SetProcessName [1]. When I commented out that call, I saw it in appshell initialization [2]. I was afraid to comment out that thing since I don't understand what it does.

However, I did find a Chromium patch [3] that shuts down the connection to the window server and prevents all future connections. I tried that and it seemed to work. I've got a try push pending to see how that goes [4]. I'm assuming that would subsume the change to the sandboxing rules, but I don't really know. None of this stuff seems to be documented.

[1] http://searchfox.org/mozilla-central/rev/765cc1b925c5d32d05111c364257a0b79bf2952e/dom/plugins/ipc/PluginUtilsOSX.mm#276
[2] http://searchfox.org/mozilla-central/rev/765cc1b925c5d32d05111c364257a0b79bf2952e/widget/cocoa/nsAppShell.mm#287
[3] https://codereview.chromium.org/673443002/patch/20001/30001
[4] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b4356599ff4d04d3233fad5ca7701913963173c
(In reply to Bill McCloskey (:billm) from comment #19)
> However, I did find a Chromium patch [3] that shuts down the connection to
> the window server and prevents all future connections. I tried that and it
> seemed to work. I've got a try push pending to see how that goes [4]. I'm
> assuming that would subsume the change to the sandboxing rules, but I don't
> really know. None of this stuff seems to be documented.

We'd still want to remove the allowance for windowserver.active from the sandbox rules because that blocks access at the kernel level. I don't know exactly how CGSSetDenyWindowServerConnections works, but from the disassembly it appears to set a process-local flag which could be easily subverted.

It's a shame to have to use private API's, but it seems like a reasonable thing to do in this situation.

I'm not sure it's worth it, but if we wanted to avoid the private API's, we could check if _RegisterApplication still works properly after the sandbox has initialized. If it does, we might be able to refactor our ContentChild setup code so that _RegisterApplication is called after we've blocked connections to the window server in the sandbox and then we wouldn't need to use the private API calls. Chromium intentionally uses the windowserver in their "sandbox warmup code" before enabling the sandbox. Since this is all undocumented behavior, I'm not sure we'd be gaining much though. And explicitly closing the windowserver.active is a good thing to do in case it is ever opened as a side-effect of a different Apple API call / in a new OS version.

--

I think that not using a native event loop will resolve bug 1306663. That's an example of messages coming in on the native event loop trying to register an IPC service in plugin-container. (i.e. an IPC server in plugin-container). We disable registering IPC services in the sandbox rules, but we still get the event loop messages which attempt to do it. And they trigger error messages on the command line console and in the Console app.
What do you think, Markus? I think our alternatives are:
1. Use CGSShutdownServerConnections when we shut down the sandbox.
2. Avoid setting the process name and also figure out how to fix up nsAppShell::Init so it doesn't use the window server. And maybe fix up anything else that contacts the window server after that.
Flags: needinfo?(mstange)
(In reply to Bill McCloskey (:billm) from comment #19)
> I set a breakpoint in _RegisterApplication and saw it being
> called from SetProcessName [1].

Ok, this one matches what the Chrome people encountered in their bug.

> When I commented out that call, I saw it in
> appshell initialization [2]. I was afraid to comment out that thing since I
> don't understand what it does.

This initializes the process-wide NSApplication instance. NSApplication manages the event loop and any open windows. Since we don't need either of those in the content process, ideally we wouldn't ever need to create NSApplication.
I'd prefer not using anything of the Mac nsAppShell in the content process, instead of initializing some of it and then never calling back into its code. But I don't know how much work that would be.

> However, I did find a Chromium patch [3] that shuts down the connection to
> the window server and prevents all future connections. I tried that and it
> seemed to work.

Interesting. So maybe CGSShutdownServerConnections() effectively tells the window server to stop monitoring this process for responsiveness.

(In reply to Bill McCloskey (:billm) from comment #21)
> What do you think, Markus? I think our alternatives are:
> 1. Use CGSShutdownServerConnections when we shut down the sandbox.
> 2. Avoid setting the process name and also figure out how to fix up
> nsAppShell::Init so it doesn't use the window server. And maybe fix up
> anything else that contacts the window server after that.

2 certainly seems like the more prudent thing to do, but 1 is so much simpler. And since Chrome is using that API, they must have determined that it's necessary. So if we have to use it anyway, we might as well use it to save us some work. Let's go with 1.
Flags: needinfo?(mstange)
Attachment #8893563 - Attachment is obsolete: true
Attachment #8893563 - Flags: review?(mstange)
Attachment #8895537 - Flags: review?(mstange)
Comment on attachment 8895537 [details] [diff] [review]
don't register thread observer in child process, v3

Review of attachment 8895537 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentChild.cpp
@@ +1480,5 @@
> +  // future connections as well. We do this for sandboxing, but it also ensures
> +  // that the Activity Monitor will not label the content process as "Not
> +  // responding" because it's not running a native event loop. See bug 1384336.
> +  CGSSetDenyWindowServerConnections(true);
> +  CGSShutdownServerConnections();

So this also runs if dom.ipc.useNativeEventProcessing.content is true? Does that work?
Attachment #8895537 - Flags: review?(mstange) → review+
Pushed by wmccloskey@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4786ec6700d0
Fix to MainThreadInvoker to avoid deadlocks (r=aklotz)
https://hg.mozilla.org/integration/mozilla-inbound/rev/01fd73e4b8b8
Stop using OS-level event loop in content process (r=mstange)
https://hg.mozilla.org/mozilla-central/rev/4786ec6700d0
https://hg.mozilla.org/mozilla-central/rev/01fd73e4b8b8
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Duplicate of this bug: 1306663
According to the kick-off meeting with Andrew, stage 2 from https://docs.google.com/document/d/1KYuJu23tzweO4WOceF5Fhle6SQ-PXnfV6NfrGf4kiBM is to be tested by Softvision - to which bug 1384336 refers to.
Our current understanding is that we need to cover testing Firefox with third-party software (accessibility, antivirus).
So, we are planning to test as follows:

Environment:
- Windows : Win 7 x32, Win 7x 64, Win 8.1 x32, Win 8.1 x64, Win 10 x32, Win 10 x64
- Linux : Ubuntu 16.04 x64 , Ubuntu 15.04 x32
- Mac OS X 10.12

Test Scenarios: running FF in the above environments with:
-Mac+ Ubuntu - native firewalls, native screen readers
-Windows: KAV 10.3.0, NVDA 

Andrew, Bill could you list that pref. that is mentioned in comment 11 if that's still applicable?
Could you also please advise if there are any variations to the above scenarios or any other scenarios or setups that would be relevant to test against? Is there anything related to content processes that would need to be tested during this test run?
Flags: needinfo?(overholt)
Flags: needinfo?(WJMII)
I'll let Bill answer.
Flags: needinfo?(wmccloskey)
Flags: needinfo?(overholt)
Flags: needinfo?(WJMII)
Depends on: 1395187
The pref is set by default, so it's fine to test a normal Nightly here. The testing setup sounds good, although the more AV/accessibility software we can test with, the better. I think that skipping Windows 8 should be fine. Probably no testing on Linux is needed at all.
Flags: needinfo?(wmccloskey)
I'm going to be disabling this for 57 in bug 1384336. I'll try to sort out these regressions and hopefully we can re-enable in 58.
(In reply to Bill McCloskey (:billm) from comment #31)
> I'm going to be disabling this for 57 in bug 1384336. I'll try to sort out
> these regressions and hopefully we can re-enable in 58.

That should have been bug 1397956.
Blocks: 1423628
Depends on: 1486971
This got re-enabled in bug 1426100.
Depends on: 1492284
You need to log in before you can comment on or make changes to this bug.