Gaia homescreen doesn't repaint while panning around when loaded out of process

RESOLVED FIXED

Status

()

Core
General
P1
normal
RESOLVED FIXED
6 years ago
6 years ago

People

(Reporter: cjones, Assigned: cjones)

Tracking

Trunk
ARM
Gonk (Firefox OS)
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-basecamp:+)

Details

If the homescreen is loaded in <iframe ... remote="true">, its content process crashes.  Not sure why yet.

Updated

6 years ago
blocking-basecamp: --- → +

Comment 1

6 years ago
How are we doing here?

Updated

6 years ago
Priority: -- → P1
Will see if fixing this ameliorates bug 779968.
Assignee: nobody → jones.chris.g
Blocks: 779968
The crash is gone (maybe I was testing the wrong thing?), but there are some other bugs.
Well, the good news (although possibly bad news) is that the layers txn counter and composition fps stay around 60.  Need to retest with the stop-excessive-recomposites patches.

Bad news is that, if I pan around really quickly on the homescreen, the content just stops moving.  It doesn't start moving again until I stop panning around.  We're not repainting unnecessarily or anything silly like that.  First suspicion is that we're not marking the homescreen active/visible properly and it stays at -10 nice or something.
Yay, with bug 778036 the not-repainting goes away.

But the fps is still the same :(.
No longer blocks: 779968
Depends on: 778036
Since I'm about 80% sure I was originally testing with the system app, I'll go ahead and morph this into the other blocker that it is, and leave open until bug 778036 lands so that we can verify the fix still works.
Summary: Crash when loading gaia homescreen app out of process → Gaia homescreen doesn't repaint while panning around when loaded out of process
Duplicate of this bug: 783076
I wasn't seeing this with patches from bug 778036, but now I am again, even after it landed :(.
The things we can try here, in parallel, are
 1. revert m-c to 8/15-era revision, re-apply bug bug 778036 and re-confirm that perf is good.  Bisect from there to tip to see what regressed this.
 2. check that the system app isn't invalidating or processing any new messages that cause the b2g process to do nontrivial work.  Some changes for async pan/zoom might have caused this.
 3. drop the process priority of the b2g process to the same as foreground content processes, and instead just increase the priority of the compositor thread

I'll look at (1).  Anyone want to volunteer to poke at (2) or (3)?
Can't reproduce my results on https://github.com/mozilla-b2g/gaia/commit/d5f5e024109ce8b788e052fdef8d4f2c7689f4b2 with this patch applied

diff --git a/apps/system/js/bootstrap.js b/apps/system/js/bootstrap.js
index 1cd9052..8dfe405 100644
--- a/apps/system/js/bootstrap.js
+++ b/apps/system/js/bootstrap.js
@@ -51,6 +51,7 @@ function startup() {
   home.src = src;
   home.setAttribute('mozapp', src.replace('index.html', 'manifest.webapp'));
   home.setAttribute('mozbrowser', 'true');
+  home.setAttribute('remote', 'true');
   home.dataset.zIndexLevel = 'homescreen';
   var windows = document.getElementById('windows');
   windows.parentNode.insertBefore(home, windows);

i.e., the pre-homescreen-as-app change to gaia.

So back to the drawing board.  Grrrrr.
Transaction counter is still showing that we're not overdrawing or anything like that.  The entire gfx pipeline just shuts down while panning around.

The cuddle animations are still running doubly-interpolated (content main thread along with compositor thread) at 60fps, which is odd ... the content process can repaint at 60fps alongside the compositor resampling at 60fps, but it can't repaint at *all* while the main thread is routing input events.
Making the master-process priority equal to the foreground priority (and also equal to background, to eliminate possibility of bugs) doesn't change anything.  Even prioritizing the foreground process above the master process doesn't help.  Aieeeee!
(the setpriority() calls are correct and aren't returning error codes)
Thinking back, I might have been testing on the Nexus S.  A kernel- or event-hub-specific problem might be contributing to this.  Will retest on nxs in a bit to double check.
In the first 1 second of fast-panning, we process 39 events in the content process, and they're generally received 10-30ms apart.  There's a 200ms outlier gap though, early in the sequence.  This seems to be when we prerender the adjacent screen.

We're not using input-event compression, which is a known bug, and this is causing the queue to back up a bit.

But the event handling everything on the content-process side looks ok ...
Alright, we're not sending PLayers:Update messages at all while panning.  Something is going badly wrong.
I'm seeing 2-3 PuppetWidget::Invalidate/Paint total during panning, but per discussion with mattwoodrow I think this is expected.

Now to see if we're getting to nsViewManager::ProcessPendingUpdatesForView/nsRefreshDriver::Notify ...
nsRefreshDriver::Notify isn't being called.  Matt, any ideas about that?

Checking now whether ProcessPendingUpdatesForView is being called separately ...
No, we don't hit ProcessPendingUpdatesForView either.
The symptom here is that while panning around in OOP web content (the gaia homescreen), we don't repaint at all.

What's happening is
 - content receives touch events
 - it sets the -moz-transform of some elements based on the touch events
 - the restyles required by those changes are being posted
 - we're scheduling a view manager flush
 - but there are no refresh observers registered with that refresh driver, so we never hit nsRefreshDriver::Notify() and never repaint

In the steady state of no animation or image refresh observers, what's supposed to observe refresh driver ticks?  For some reason, whatever that is seems not to be running in content processes.
When style changes happen, nsCSSFrameConstructor::PostRestyleEventInternal calls mPresShell->GetPresContext()->RefreshDriver()->AddStyleFlushObserver(mPresShell). That's what triggers a new refresh driver tick to flush styles on the presshell.

Comment 23

6 years ago
I changed this code in my original OMTA patch to no longer request that refresh driver tick. Maybe thats what breaks here? dz probably kept the basic design.
This is all restyling from explicit -moz-transform changes, no animations.  It works in-process.
Animations work fine OOP too, 60fps (with sidecar interpolation).
Here's approximately what's happening (with lots of logging, which is making things about 4x more costly)
 - the homescreen starts on the page with just the search icon
 - the content process is processing touchmove quickly, keeping up with dispatch from the parent
   * because we're on the page with just the icon, restyles and invalidates are cheap
   * content process is sleeping between events
 - we process the first refresh driver tick, which repaints the second page
 - the second page is more complex (lots more icons).  repainting is relatively expensive
   * while repainting, the event queue continues filling up
 - we being flushing event queue
   * restyles take longer now because we're processing both pages (?)
   * content process no longer sleeping between events
 - for some reason, we don't process more refresh driver ticks

So there are three bad things going on here
 1. not compressing input events.  This is the worst.
 2. restyles taking longer than they should.  Exacerbates (1)
 3. not process refresh-driver timer even though it's fired.  Related to (1), but possibly indicates event priority problem.

bent, do you have any idea why (3) might be occurring?
Depends on: 774988, 779968

Comment 27

6 years ago
Should we send timer events from the parent to avoid priority issues?
That will just create different performance problems.
Based on some unscientific tests, I think we'll be able free up 8-10MB of locked memory by moving the homescreen OOP.  And that's not counting the system wallpaper, which the system app is still rendering in my test.  That might free up another MB.
The panning fps on OOP homescreen has gone down by about 10fps since last I checked.  Grrrrr.  Have some more work to do to move it OOP.
The work here is done.  Optimizations remain.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
No longer depends on: 779968
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.