Open Bug 1305130 Opened 8 years ago Updated 2 years ago

Allow painting for APZ when JS is running

Categories

(Core :: Panning and Zooming, defect, P3)

52 Branch
defect

Tracking

()

People

(Reporter: billm, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [gfx-noted])

Attachments

(1 file)

This is a follow-on to bug 1279086. This idea is to call PresShell::Paint whenever APZ scrolls. We want to update the displayport without updating the scroll position. After talking to kats, the basic outline would be:

1. When this happens:
http://searchfox.org/mozilla-central/source/gfx/layers/ipc/RemoteContentController.cpp#47
2. We want to send a message to the child using the hang monitor protocol.
3. The child would call UpdateDisplayPortMarginsFromPendingMessages(). 
4. Then it would check if the new tile-aligned displayport is different from the old one
5. and paint if it is.

This seems pretty easy to test, at least. Initially I won't do step 4.
Component: Graphics → Panning and Zooming
Keywords: perf
OS: Unspecified → All
Priority: -- → P3
Hardware: Unspecified → All
Whiteboard: [gfx-noted]
Version: unspecified → 52 Branch
Attached patch WIP patchSplinter Review
This is a proof of concept. The code is pretty hacky, with lots of duplication. But it seems to work (once the patch in bug 1305165 is applied). There are a couple of issues:

1. We really should be checking if the new displayport differs much from the existing one. If it doesn't, we should skip the paint. This is especially important because we might have received the RequestContentRepaint message on the main thread and processed it, then later gotten the ForceAPZPaint inside some random JS code that ran later. So we need much better checking that the paint is necessary. kats, you said we already have some code like this somewhere else. Can you give advice how I could use it?

2. I can scroll around just fine, even outside the displayport that was set before the JS started. However, when the JS finishes, I often see a flash of white for about 1/2 a second. Looking for ideas. I can debug if need be, but I though I'd ask.

3. David, I think this is going to break with the GPU process. Maybe we can talk at some point about how to make it work? One option is to bounce the ForceAPZPaint message through the UI process. Another option is to establish something like the hang monitor channel from the GPU process to each content process. The latter seems like maybe overkill though.
Attachment #8794420 - Flags: feedback?(dvander)
Attachment #8794420 - Flags: feedback?(bugmail)
From IRC:


<botond> billm: to answer point #1 from bug 1305130 comment 1: i think the code kats is referring to is here: http://searchfox.org/mozilla-central/source/layout/generic/nsGfxScrollFrame.cpp#2857
<firebot> https://bugzil.la/1305130 — NEW, wmccloskey@mozilla.com — Allow painting for APZ when JS is running
<botond> billm: specifically the |displayPort.IsEqualEdges(oldDisplayPort)| check
<botond> billm: where those display port values were obtained using nsLayoutUtils::GetHighResolutionDisplayPort, which returns a tile-aligned displayport
<billm> botond: oh, thanks!
<botond> billm: correction: those display port values were obtained using nsLayoutUtils::GetHighResolutionDisplayPort, and then translated by the negation of the scroll offset
<botond> billm: the reason for the translation is that GetHighResolutionDisplayPort returns the display port relative to the scroll offset
<botond> billm: whereas we want to see if the display port has changed relative to the origin of the scrollable element
<billm> botond: what is paint skipping?
<botond> er, the origin of the scrolled content
<botond> billm: so, scrolling usually schedules a paint
<botond> billm: but if we figure out that that paint would just re-use the same displayport we've already painted, we don't bother scheduling the paint, so that we can skip the display list building etc. steps as well
Also, this test is failing:
 1557 INFO TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_touchevents.html | helper_scrollto_tap.html?true | Document element didn't get painted - got true, expected false 

https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb2b27ed63e8&selectedJob=27992872
Comment on attachment 8794420 [details] [diff] [review]
WIP patch

Review of attachment 8794420 [details] [diff] [review]:
-----------------------------------------------------------------

Generally looks good. See comments below

::: dom/ipc/ProcessHangMonitor.cpp
@@ +407,5 @@
> +    mForcePaintTab = aTabId;
> +    mForcePaintEpoch = 0;
> +  }
> +
> +  JS_RequestInterruptCallback(mContext);

Does this have any sort of built-in delay, or does it interrupt the JS right away? The reason I'm concerned is that there's a tradeoff here - if we delay too much we will run into checkerboarding, but if we are constantly forcing paints it will slow down the main thread. This can be bad because it will cause the main-thread scroll position updates to be delayed as well, and so things like scroll-linked effects (where JS listens for scroll position changes and updates some animation/effect to stay in sync) will be even more laggy than they already are with APZ.

::: dom/ipc/TabChild.cpp
@@ +3308,5 @@
> +    // It's for a tab switch.
> +    RecvSetDocShellIsActive(true, false, aLayerObserverEpoch);
> +  } else {
> +    // It's for APZ.
> +    nsLayoutUtils::UpdateDisplayPortMarginsFromPendingMessages();

So, assuming we have the scrollId here (see my comment in AsyncPanZoomController.cpp) I think we need to do something like this:

nsIContent* content = nsLayoutUtils::FindContentFor(aScrollId);
if (!content) return;
nsRect oldDisplayPort;
if (!nsLayoutUtils::GetHighResolutionDisplayPort(content, &oldDisplayPort)) return;
nsLayoutUtils::UpdateDisplayPortMarginsFromPendingMessages();
nsRect newDisplayPort;
if (!nsLayoutUtils::GetHighResolutionDisplayPort(content, &newDisplayPort)) return;
if (oldDisplayPort.IsEqualEdges(newDisplayPort)) return;

// do paint if we get here

::: gfx/layers/apz/src/AsyncPanZoomController.cpp
@@ +2928,5 @@
> +  uint64_t layersId = mLayersId;
> +  NS_DispatchToMainThread(NS_NewRunnableFunction([layersId]() {
> +        if (dom::TabParent* tp = dom::TabParent::GetTabParentFromLayersId(layersId)) {
> +          dom::ContentParent* cp = tp->Manager()->AsContentParent();
> +          cp->ForceAPZPaint(tp);

I'd prefer to have this code live in RemoteContentController::RequestContentRepaint. In the GPU process case if you want to bounce it through the UI process we have similar code in the RemoteContentController::HandleTap function that you can crib from (although some of the dispatch-to-another-thread might be different).

Also, this should be wrapped in a pref so we can disable it if we need to. Maybe something like apz.interrupt_js.enabled.

Finally, I think we should also pass the scrollId through this ForceAPZPaint function. The reason is that in TabChild.cpp, where we would want to do the "has the displayport really moved" check, we want to know *which* displayport to be checking. Since there are displayports for all actively scrolled elements, we don't want to be checking all of them, and having the scrollId of the thing that's triggering the repaints would help us know which displayport we should check. The scrollId is available here in aFrameMetrics.GetScrollId().
Attachment #8794420 - Flags: feedback?(bugmail) → feedback+
(In reply to Bill McCloskey (:billm) from comment #1)
> 1. We really should be checking if the new displayport differs much from the
> existing one. If it doesn't, we should skip the paint. This is especially
> important because we might have received the RequestContentRepaint message
> on the main thread and processed it, then later gotten the ForceAPZPaint
> inside some random JS code that ran later. So we need much better checking
> that the paint is necessary. kats, you said we already have some code like
> this somewhere else. Can you give advice how I could use it?

As botond mentioned above there is code that does this in nsGfxScrollFrame, and also in the nsLayoutUtils::SetDisplayPortMargins function. I outlined the code that should do the job in my previous comment.

> 2. I can scroll around just fine, even outside the displayport that was set
> before the JS started. However, when the JS finishes, I often see a flash of
> white for about 1/2 a second. Looking for ideas. I can debug if need be, but
> I though I'd ask.

Hm, I suspect what's happening here is that while the JS is running we are painting "ahead" (i.e. painting the latest stuff APZ is asking us to paint) but then when the JS stops maybe we go back and paint old things that are still in the queue? In theory that shouldn't happen because of the existing peek-messages code [1] but maybe that's not working as intended. One thing that might help debug this is looking at the checkerboard recording from about:checkerboard. It's not the most intuitive debugging tool but you can step through changes in APZ-side sequence of events, and will probably see the painted displayport jump back to some stale value and cause the checkerboard.
(In reply to Bill McCloskey (:billm) from comment #3)
> Also, this test is failing:
>  1557 INFO TEST-UNEXPECTED-FAIL |
> gfx/layers/apz/test/mochitest/test_group_touchevents.html |
> helper_scrollto_tap.html?true | Document element didn't get painted - got
> true, expected false 

I guess that makes sense. The test enabled paint-skipping and makes sure that an element doesn't get painted in a particular scenario. But your patch adds another codepath that triggers a paint, so the element does get painted in that scenario. The displayport-check code I suggested should take care of this by skipping the paint in this scenario.
There were two other problems with the try push:
1. One is somehow related to fullscreen mode. Hopefully that won't be too hard to fix. At worst we could disable this in fullscreen mode.

2. We're asserting here:
http://searchfox.org/mozilla-central/rev/d1276b5b84e6cf7991c8e640b5e0ffffd54575a6/layout/generic/nsTextFrame.cpp#2864

I guess we're getting a new text frame and trying to paint it without having reflowed it, which this code doesn't like. I wonder if this is an unusual case or if it's generally bad to paint before reflowing something? I'm also a bit concerned because bug 1279086 probably has the same issue. We're just less likely to hit it.

Matt, do you have any ideas here?
Flags: needinfo?(matt.woodrow)
(In reply to Bill McCloskey (:billm) from comment #7)
> There were two other problems with the try push:
> 1. One is somehow related to fullscreen mode. Hopefully that won't be too
> hard to fix. At worst we could disable this in fullscreen mode.
> 
> 2. We're asserting here:
> http://searchfox.org/mozilla-central/rev/
> d1276b5b84e6cf7991c8e640b5e0ffffd54575a6/layout/generic/nsTextFrame.cpp#2864
> 
> I guess we're getting a new text frame and trying to paint it without having
> reflowed it, which this code doesn't like. I wonder if this is an unusual
> case or if it's generally bad to paint before reflowing something? I'm also
> a bit concerned because bug 1279086 probably has the same issue. We're just
> less likely to hit it.

This case sounds pretty bad.  This invariant is relied on for justified text (see bug 230555) perhaps among other things.  :(  Jonathan may know about this too.
Flags: needinfo?(jfkthame)
(In reply to Bill McCloskey (:billm) from comment #7)
>  wonder if this is an unusual
> case or if it's generally bad to paint before reflowing something?

It sounds "generally bad" to me. A frame that has never been reflowed can't be expected to paint correctly: not just its own immediate content, but its children won't have been placed properly, and its parent may not know its correct size, so the entire layout we're trying to paint may be broken.

The sequence "reflow, then paint" is a pretty firm requirement, I think.
Flags: needinfo?(jfkthame)
I agree, trying to paint frames that haven't been reflowed sounds dangerous.

Can we check the refresh driver to see if a view manager flush has been scheduled?

If it has, then something in the document has changed, and we need to do a full flush and should probably wait for the refresh driver (or we could look into doing that ourselves).
If it hasn't, then the document is unchanged from last time we painted and we can safely paint again.
Flags: needinfo?(matt.woodrow)
These paints are expected to be "best effort". We're already paint in the middle of JS running, so the page could be in the middle of some sort of update. I think we want to do something that:
1. never crashes
2. always paints something
3. minimizes weird artifacts.

Once the content process has "caught up" and returned to processing events, we'll be able to do a proper reflow and paint.

Matt's proposal will always be correct, but there probably will be quite a few cases where we can't paint. Could we instead skip painting frames that have never been reflowed? It seems like the NS_FRAME_FIRST_REFLOW flag that we're asserting on it would be perfect for checking this.
Is this problem restricted to first reflows though?  Subsequent reflows seem to have similar issues to never reflown before frames.

As an alternative, would it make sense to skip painting such frames even if it's not the first reflow that we're waiting for?

(Relatedly, are other frame types also affected by this issue?  Skipping painting text frames seems kind of defensible since at least in the case of text using a web font, we may paint a version before the font is available...)
I suspect that trying to paint without flushing style and reflowing is likely to be too risky and could easily give us really broken drawing.

I believe that trying to draw frames other than text that need a reflow but haven't yet had it is too likely to result in massively incorrect drawing. We could skip all frames that need a reflow, but it could easily be the root frame, so we'd need some sort of heuristic to determine when the frame tree is too damaged for painting to be able to continue in a sane way.

Can we instead make sure we do all delayed changes before we paint?

This will at least give us accurate drawing of the changes made by content so far, which seems a lot safer to me.

Another option would be to instead try to guarantee that no changes get made to the frame tree/style data since we last painted, so we know that we're just painting a different scrolled area and not including an partial content changes.

I can think of two ways to achieve this, but they're both significantly more involved than what you're doing now.

* Make sure all changes to anything that might be used by painting are deferred, and then only allow this optimization to run if nothing has forced some/all of these to flush (usually content requesting computed layout data).

* Split painting into two phases, recording and rasterizing. Make the record phase build display lists and record all drawing commands for an area much larger than the current display port and have the rasterize phase restrict actual drawing to the real display port area. That way we can replay the rasterize phase a second time and draw new areas without reading any data from the frame tree/style.

Adding ni?dbaron since I'm sure he'll be interested in this too.
Flags: needinfo?(dbaron)
To be clear, we have the invariant that all layout/style changes are completely flushed before we paint. It's hard to know exactly what code is depending on this (way more code than I fancy auditing), but removing it scares me.
I talked to Matt a bit more about this. We can't do reflow during these special paints since reflow can cause JS to run and we're forbidden to run JS at these times (since JS is already running).

I think it makes sense to consider simple mitigations like not painting frames that have never been reflowed. It's still possible that things will look bad even if we do this, but the goal is just to reduce the probability.

Since there are a lot of unknowns here, we discussed some testing strategies. First, we can force these special paints to happen a lot more often and then push to try. That will help to expose any crashing issues, which would be very serious.

Second, we can ask our QA team to do A/B tests. We would introduce JS jank in the content process and then ask people if the experience is subjectively better with the special paints turned on (where they might see weird paints) or off (where the jank would be more perceptible). We'll ask them to visit a variety of web sites to get a feel for what sorts of brokenness can happen. I filed bug 1307674 for this.
I've been trying to do this similar to HandleTap as suggested. For the GPU process, it looks like I'm supposed to call GetApzcTreeManagerParentForRoot and then route the ForcePaint message through that protocol. But if there's no GPU process, then GetApzcTreeManagerParentForRoot returns null and I guess I'm just supposed to dispatch to the main thread directly. It would be a lot nicer if GetApzcTreeManagerParentForRoot always returned something that I could use. David, is there a reason that the chrome process doesn't have an APZCTreeManagerChild?
Flags: needinfo?(dvander)
Dropping stale needinfos.

(In reply to Bill McCloskey (:billm) from comment #16)
> It would be a lot nicer if 
> GetApzcTreeManagerParentForRoot always returned something that I could use.

Looking at the implementation of CompositorBridgeParent::GetApzcTreeManagerParentForRoot, if you pass in the layers id for the chrome process, you should still get back the correct APZCTreeManagerParent. The code at [1] is just going to be redundant in that case, because the rootLayersId will be the same as the aContentLayersId passed in.

> is there a reason that the chrome process doesn't have an APZCTreeManagerChild?

I don't think there's any particular reason for this (in the long term). I think the reason it is the way it is now, is that we didn't want to change the in-process codepath, so as to minimize risk. Now that the GPU process is enabled in Nightly and is getting some bake time we have more confidence in the APZCTreeManagerParent/Child interface, and could switch over the InProcessCompositorSession to also use it.

[1] http://searchfox.org/mozilla-central/rev/59bb309e38b10aba63dea8505fb800e99fe821d6/gfx/layers/ipc/CompositorBridgeParent.cpp#1910
Flags: needinfo?(dvander)
Flags: needinfo?(dbaron)
Comment on attachment 8794420 [details] [diff] [review]
WIP patch

I'm going to drop this f? flag so that this bug stops showing up in my triage dashboard. I talked to billm last week and although this is still something we want to pursue, nobody really has cycles to spend on it. Leaving the bug open for now.
Attachment #8794420 - Flags: feedback?(dvander)
I'm not working on this right now. Ehsan indicated he might want to pick it up.
Assignee: wmccloskey → nobody
Assignee: nobody → ehsan
Blocks: QuantumFlow
I asked help from Chris on this.
Assignee: ehsan → cpearce
Assignee: cpearce → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: