Closed Bug 1350634 Opened 3 years ago Closed 3 years ago

The PCompositorBridge::Msg_PLayerTransactionConstructor sync IPC is extremely inefficient

Categories

(Core :: Graphics: Layers, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: ehsan, Assigned: dvander)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [gfx-noted][qf:investigate])

Attachments

(1 file)

According to our telemetry data, the median time of this is 16.84ms, which is already above a frame budget.  :(

Bas, can you please find an owner?  Thanks!
Flags: needinfo?
Flags: needinfo? → needinfo?(bas)
I captured this on a profile <https://perfht.ml/2nq1R43>, it's super easy to reproduce at least on OSX on a new profile when opening a new window.  In this profile this took 63 ms.  This, plus bug 1348361 together are responsible for nsFrameLoader::TryRemoteBrowser() janking the UI thread for 121ms.
Blocks: UIJank
See Also: → 1348361
16.84 is -just- over 16.66. I suspect this may have to do with some blocking presentation call. This is probably hard to make async, but I'll let David comment on that.
Flags: needinfo?(bas) → needinfo?(dvander)
The profile in comment #1 looks Mac-specific, all the time is in GLContext::InitWithPrefixImpl, doing library loading stuff. That is also specific to opening a window, the profile would look different for a new tab.

Though unfortunately Bas is right, if we start a composite as we open a tab, we'll block until that composite finishes. Making this not synchronous is hard, though maybe we could delay blocking until the tab first needs to start painting.
Flags: needinfo?(dvander)
Whiteboard: [qf]
Whiteboard: [qf] → [qf:p1]
Not an easy problem, not likely to get attention in the short term.  May want to reconsider the priority.
Whiteboard: [qf:p1] → [gfx-noted][qf]
Windows profile/performance problem would be appreciated.
I might have to retract comment #3 - Kats did a clever refactoring in bug 1350638 that I think could be used to make this async. We would still have to send some kind of sync message in the open-window case though.

We'd stick the TextureFactoryIdentifier and layersid in the same place we propagate CompositorOptions, and drop those values from the PLayerTransaction constructor. For the top-level window case, we'd add a new sync message to PLayerTransaction, "IdentifyLayers" or something, so the widget could still get that info.
FWIW this code is on the path of opening a new window which is one of the key interactions that the UX team has asked us to focus on optimizing for.
We should see what is happening during the 60ms the compositor is working during this call. If we make this call async, we'll free up the main thread, but it will likely still take 60ms to show the user something and thus the work would be less valuable than actually making showing the user something faster.
Whiteboard: [gfx-noted][qf] → [gfx-noted][qf:investigate]
I did some Telemetry investigation on this. For sessions with the GPU process, this message appears in the content IPC_SYNC_MAIN_LATENCY_MS histogram 16% of the time. For sessions without the GPU process, it appears 84% of the time. It will be interesting to see if that holds after the GPU process rollout continues to software users.
Attached patch patchSplinter Review
The reason PLayerTransactionParent's constructor is synchronous is to acquire the TextureFactoryIdentifier. This is already sent via InitRenderingState, which is ultimately acquired by the RenderFrameParent. In theory we can just use this instead, and drop the sync requirement.

For top-level widgets, we need a new sync message since to get the T-I-F.

There are two potential problems with this. One is that, it is possible (albeit super rare) for us to associate a layers ID with a top-level compositor, create a TabParent, and then shut down the TabParent while the TabChild is trying to initialize rendering. The sync message let us detect whether the connection was severed. However, I don't think this check was actually meaningful. There is no guaranteed ordering between unmapping the layer id and the LayerTransactionParent constructor, so we could have returned "success" and then had a dead association anyway.

Another potential problem is that the TextureFactoryIdentifier might change in between associating to a compositor and the PLayerTransaction constructor. I think that's pre-existing though. The AdoptChild message does not acquire a new T-I-F, so if you drag+drop a tab across compositor types, that information would have been incorrect already.

Seems to be okay on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=41ddd458651e9fcd30f4055076b75c452cc7927b
Assignee: nobody → dvander
Status: NEW → ASSIGNED
Attachment #8860245 - Flags: review?(matt.woodrow)
Attachment #8860245 - Flags: review?(matt.woodrow) → review+
Comment on attachment 8860245 [details] [diff] [review]
patch

Review of attachment 8860245 [details] [diff] [review]:
-----------------------------------------------------------------

r+ but we should keep an eye out for spikes in crashes or any new crash signatures that might fall out from this.

::: dom/ipc/TabChild.cpp
@@ +2713,3 @@
>        PLayerTransactionChild* shadowManager =
> +          compositorChild->SendPLayerTransactionConstructor(backends, aLayersId);
> +      if (shadowManager) {

So I'm a little concerned about this codepath, because it undoes the effect of bug 1322633 - that is, if the compositor bridge parent fails and sends back the "dummy" PLayerTransaction, we are going to go ahead and call InitAPZState() anyway.

That being said, the crash that bug 1322633 was intended to fix is still happening (e.g. bug 1354960, there are more) so maybe it's ok that we undoing it, since we need a better fix for that anyway.
Attachment #8860245 - Flags: review?(bugmail) → review+
Comment on attachment 8860245 [details] [diff] [review]
patch

Bill, r? for the sync message change.
Attachment #8860245 - Flags: review?(wmccloskey)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
> 
> So I'm a little concerned about this codepath, because it undoes the effect
> of bug 1322633 - that is, if the compositor bridge parent fails and sends
> back the "dummy" PLayerTransaction, we are going to go ahead and call
> InitAPZState() anyway.
> 
> That being said, the crash that bug 1322633 was intended to fix is still
> happening (e.g. bug 1354960, there are more) so maybe it's ok that we
> undoing it, since we need a better fix for that anyway.

Okay, I'm half expecting this to blow up for any of a million reasons anyway :) I'm a bit worried about potentially leaving a dangling LayerTransaction there. But hopefully if problems come up they are not too difficult to work around.
Attachment #8860245 - Flags: review?(wmccloskey) → review+
Depends on: 1360171
https://perfht.ml/2poYI68 is a profile where this message is taking 600ms on our reference hardware.
Unfortunately that looks like a top-level widget which this bug won't solve. It looks like the GPU process isn't ready yet.

Looking at the timeline there, maybe we should try starting the GPU process much earlier in XPCOM init.
(In reply to David Anderson [:dvander] from comment #15)
> Unfortunately that looks like a top-level widget which this bug won't solve.
> It looks like the GPU process isn't ready yet.
> 
> Looking at the timeline there, maybe we should try starting the GPU process
> much earlier in XPCOM init.

Scary, but probably necessary to at least try and see how the numbers change.

Separate "start" and "initialize" for the GPU process?  In other words, just create it, but don't have it do anything until we send it another message with the information we'd need.

Otherwise we'd have to at least wait for the preferences system to be initialized before we can start it.
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7685ecd789d
Make PLayerTransaction's constructor async. (bug 1350634, ipc_r=billm, r=mattwoodrow,kats)
(In reply to Milan Sreckovic [:milan] from comment #16)
> (In reply to David Anderson [:dvander] from comment #15)
> > Unfortunately that looks like a top-level widget which this bug won't solve.
> > It looks like the GPU process isn't ready yet.
> > 
> > Looking at the timeline there, maybe we should try starting the GPU process
> > much earlier in XPCOM init.
> 
> Scary, but probably necessary to at least try and see how the numbers change.
> 
> Separate "start" and "initialize" for the GPU process?  In other words, just
> create it, but don't have it do anything until we send it another message
> with the information we'd need.
> 
> Otherwise we'd have to at least wait for the preferences system to be
> initialized before we can start it.

Yeah, that is exactly what I'm thinking. Ideally we should be able to launch the GPU process and let it start XPCOM, but maybe just launching it at all is a good place to be.

I'll do some investigation in bug 1360171.
Backout by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8b70deabfe10
Backed out changeset f7685ecd789d for marionette crashes
FYI, the reason for the backout were crashes. But those are already known for a while and happen intermittently. See bug 1338422 and others under see also for details.
That assert looks maybe bogus to me, I don't think the checks that precede it are enough. We clear |LayerTreeState::mParent| in StopAndClearResources, but rely on the RenderFrameParent::ActorDestroy to actually unmap the layer tree ID. So we could have a mapped ID, but have lost our compositor, if for example the window is destroyed while a tab is loading.

To verify this I put a sleep(30) in InitAPZState in the content process, opened a few windows and then closed them all. The UI process crashed.
Flags: needinfo?(dvander)
Filed bug 1360478 separately since it will be a good candidate for uplift.
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/867fcd3e181d
Make PLayerTransaction's constructor async. (bug 1350634, ipc_r=billm, r=mattwoodrow, r=kats)
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/084bebaffd26 for leaking in sessionstore's browser-chrome tests, asan: https://treeherder.mozilla.org/logviewer.html#?job_id=95582368&repo=mozilla-inbound, windows: https://treeherder.mozilla.org/logviewer.html#?job_id=95586172&repo=mozilla-inbound (wouldn't be surprised if it also would have leaked on Mac and Linux, but we couldn't be arsed to actually run the chunks of browser-chrome with the sessionstore tests in them on anything but asan and windows).
Perhaps the code here:

     if (!success) {
       NS_WARNING("failed to re-allocate layer transaction");
       return;
     }

should be cleaning this up?
(In reply to Bill Gianopoulos [:WG9s] from comment #26)
> Perhaps the code here:
> 
>      if (!success) {
>        NS_WARNING("failed to re-allocate layer transaction");
>        return;
>      }
> 
> should be cleaning this up?

Hrm, I don't think that code is related.

It's possible that making this non-sync has exposed timing issues that make the fix for bug 1360478 leaky. bug 1337062 had a similar issue. But I'm not yet sure how this leak could happen. CPCBP blocks the compositor thread from shutting down until all of its actors have been cleaned up, and there's nothing fancy about this actor. It's allocated and deleted unconditionally.
Blocks: 1360171
No longer depends on: 1360171
Oh, I know what's happening. Making this message asynchronous makes it much more likely (still rare, but much more likely) that we'll actually hit the scenario in bug 1360478, since the tab can issue the PAPZCTreeManager message a split second faster. That bug has a leak: we never clear mFlushObserver on the APZCTM, so it effectively retains a reference to itself. It's just extremely rare we'd see it on TreeHerder without this patch.

Kats, here's a few routes we could go to fix this:
 (1) Allow APZCTreeManagerParent to have a null APZCTreeManager, and null check in each callback.
 (2) Call ClearTree immediately after allocating the temp APZCTM.
 (3) Have some kind of alternate constructor that doesn't create the observer.

Any preference?
Flags: needinfo?(bugmail)
(In reply to David Anderson [:dvander] from comment #28)
> Kats, here's a few routes we could go to fix this:
>  (1) Allow APZCTreeManagerParent to have a null APZCTreeManager, and null
> check in each callback.
>  (2) Call ClearTree immediately after allocating the temp APZCTM.
>  (3) Have some kind of alternate constructor that doesn't create the
> observer.
> 
> Any preference?

My preference is for (2). Second choice would be (3), but I'd like to stay away from (1) as it incurs an overhead for the common case just to handle this edge case behaviour.
Flags: needinfo?(bugmail)
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d5920dc4472
Make PLayerTransaction's constructor async. (bug 1350634, ipc_r=billm, r=mattwoodrow, r=kats)
https://hg.mozilla.org/mozilla-central/rev/8d5920dc4472
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Depends on: 1363306
You need to log in before you can comment on or make changes to this bug.