Closed
Bug 1350634
Opened 8 years ago
Closed 8 years ago
The PCompositorBridge::Msg_PLayerTransactionConstructor sync IPC is extremely inefficient
Categories
(Core :: Graphics: Layers, defect)
Core
Graphics: Layers
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox55 | --- | fixed |
People
(Reporter: ehsan.akhgari, Assigned: dvander)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [gfx-noted])
Attachments
(1 file)
According to our telemetry data, the median time of this is 16.84ms, which is already above a frame budget. :(
Bas, can you please find an owner? Thanks!
Flags: needinfo?
| Reporter | ||
Updated•8 years ago
|
Flags: needinfo? → needinfo?(bas)
| Reporter | ||
Comment 1•8 years ago
|
||
I captured this on a profile <https://perfht.ml/2nq1R43>, it's super easy to reproduce at least on OSX on a new profile when opening a new window. In this profile this took 63 ms. This, plus bug 1348361 together are responsible for nsFrameLoader::TryRemoteBrowser() janking the UI thread for 121ms.
Comment 2•8 years ago
|
||
16.84 is -just- over 16.66. I suspect this may have to do with some blocking presentation call. This is probably hard to make async, but I'll let David comment on that.
Flags: needinfo?(bas) → needinfo?(dvander)
| Assignee | ||
Comment 3•8 years ago
|
||
The profile in comment #1 looks Mac-specific, all the time is in GLContext::InitWithPrefixImpl, doing library loading stuff. That is also specific to opening a window, the profile would look different for a new tab.
Though unfortunately Bas is right, if we start a composite as we open a tab, we'll block until that composite finishes. Making this not synchronous is hard, though maybe we could delay blocking until the tab first needs to start painting.
Flags: needinfo?(dvander)
Updated•8 years ago
|
Whiteboard: [qf]
Updated•8 years ago
|
Whiteboard: [qf] → [qf:p1]
Not an easy problem, not likely to get attention in the short term. May want to reconsider the priority.
Whiteboard: [qf:p1] → [gfx-noted][qf]
Windows profile/performance problem would be appreciated.
| Assignee | ||
Comment 6•8 years ago
|
||
I might have to retract comment #3 - Kats did a clever refactoring in bug 1350638 that I think could be used to make this async. We would still have to send some kind of sync message in the open-window case though.
We'd stick the TextureFactoryIdentifier and layersid in the same place we propagate CompositorOptions, and drop those values from the PLayerTransaction constructor. For the top-level window case, we'd add a new sync message to PLayerTransaction, "IdentifyLayers" or something, so the widget could still get that info.
| Reporter | ||
Comment 7•8 years ago
|
||
FWIW this code is on the path of opening a new window which is one of the key interactions that the UX team has asked us to focus on optimizing for.
Comment 8•8 years ago
|
||
We should see what is happening during the 60ms the compositor is working during this call. If we make this call async, we'll free up the main thread, but it will likely still take 60ms to show the user something and thus the work would be less valuable than actually making showing the user something faster.
Whiteboard: [gfx-noted][qf] → [gfx-noted][qf:investigate]
| Assignee | ||
Comment 9•8 years ago
|
||
I did some Telemetry investigation on this. For sessions with the GPU process, this message appears in the content IPC_SYNC_MAIN_LATENCY_MS histogram 16% of the time. For sessions without the GPU process, it appears 84% of the time. It will be interesting to see if that holds after the GPU process rollout continues to software users.
| Assignee | ||
Comment 10•8 years ago
|
||
The reason PLayerTransactionParent's constructor is synchronous is to acquire the TextureFactoryIdentifier. This is already sent via InitRenderingState, which is ultimately acquired by the RenderFrameParent. In theory we can just use this instead, and drop the sync requirement.
For top-level widgets, we need a new sync message since to get the T-I-F.
There are two potential problems with this. One is that, it is possible (albeit super rare) for us to associate a layers ID with a top-level compositor, create a TabParent, and then shut down the TabParent while the TabChild is trying to initialize rendering. The sync message let us detect whether the connection was severed. However, I don't think this check was actually meaningful. There is no guaranteed ordering between unmapping the layer id and the LayerTransactionParent constructor, so we could have returned "success" and then had a dead association anyway.
Another potential problem is that the TextureFactoryIdentifier might change in between associating to a compositor and the PLayerTransaction constructor. I think that's pre-existing though. The AdoptChild message does not acquire a new T-I-F, so if you drag+drop a tab across compositor types, that information would have been incorrect already.
Seems to be okay on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=41ddd458651e9fcd30f4055076b75c452cc7927b
| Assignee | ||
Updated•8 years ago
|
Attachment #8860245 -
Flags: review?(bugmail)
Updated•8 years ago
|
Attachment #8860245 -
Flags: review?(matt.woodrow) → review+
Comment 11•8 years ago
|
||
Comment on attachment 8860245 [details] [diff] [review]
patch
Review of attachment 8860245 [details] [diff] [review]:
-----------------------------------------------------------------
r+ but we should keep an eye out for spikes in crashes or any new crash signatures that might fall out from this.
::: dom/ipc/TabChild.cpp
@@ +2713,3 @@
> PLayerTransactionChild* shadowManager =
> + compositorChild->SendPLayerTransactionConstructor(backends, aLayersId);
> + if (shadowManager) {
So I'm a little concerned about this codepath, because it undoes the effect of bug 1322633 - that is, if the compositor bridge parent fails and sends back the "dummy" PLayerTransaction, we are going to go ahead and call InitAPZState() anyway.
That being said, the crash that bug 1322633 was intended to fix is still happening (e.g. bug 1354960, there are more) so maybe it's ok that we undoing it, since we need a better fix for that anyway.
Attachment #8860245 -
Flags: review?(bugmail) → review+
| Assignee | ||
Comment 12•8 years ago
|
||
Comment on attachment 8860245 [details] [diff] [review]
patch
Bill, r? for the sync message change.
Attachment #8860245 -
Flags: review?(wmccloskey)
| Assignee | ||
Comment 13•8 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
>
> So I'm a little concerned about this codepath, because it undoes the effect
> of bug 1322633 - that is, if the compositor bridge parent fails and sends
> back the "dummy" PLayerTransaction, we are going to go ahead and call
> InitAPZState() anyway.
>
> That being said, the crash that bug 1322633 was intended to fix is still
> happening (e.g. bug 1354960, there are more) so maybe it's ok that we
> undoing it, since we need a better fix for that anyway.
Okay, I'm half expecting this to blow up for any of a million reasons anyway :) I'm a bit worried about potentially leaving a dangling LayerTransaction there. But hopefully if problems come up they are not too difficult to work around.
Attachment #8860245 -
Flags: review?(wmccloskey) → review+
| Reporter | ||
Comment 14•8 years ago
|
||
https://perfht.ml/2poYI68 is a profile where this message is taking 600ms on our reference hardware.
| Assignee | ||
Comment 15•8 years ago
|
||
Unfortunately that looks like a top-level widget which this bug won't solve. It looks like the GPU process isn't ready yet.
Looking at the timeline there, maybe we should try starting the GPU process much earlier in XPCOM init.
(In reply to David Anderson [:dvander] from comment #15)
> Unfortunately that looks like a top-level widget which this bug won't solve.
> It looks like the GPU process isn't ready yet.
>
> Looking at the timeline there, maybe we should try starting the GPU process
> much earlier in XPCOM init.
Scary, but probably necessary to at least try and see how the numbers change.
Separate "start" and "initialize" for the GPU process? In other words, just create it, but don't have it do anything until we send it another message with the information we'd need.
Otherwise we'd have to at least wait for the preferences system to be initialized before we can start it.
Comment 17•8 years ago
|
||
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7685ecd789d
Make PLayerTransaction's constructor async. (bug 1350634, ipc_r=billm, r=mattwoodrow,kats)
| Assignee | ||
Comment 18•8 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #16)
> (In reply to David Anderson [:dvander] from comment #15)
> > Unfortunately that looks like a top-level widget which this bug won't solve.
> > It looks like the GPU process isn't ready yet.
> >
> > Looking at the timeline there, maybe we should try starting the GPU process
> > much earlier in XPCOM init.
>
> Scary, but probably necessary to at least try and see how the numbers change.
>
> Separate "start" and "initialize" for the GPU process? In other words, just
> create it, but don't have it do anything until we send it another message
> with the information we'd need.
>
> Otherwise we'd have to at least wait for the preferences system to be
> initialized before we can start it.
Yeah, that is exactly what I'm thinking. Ideally we should be able to launch the GPU process and let it start XPCOM, but maybe just launching it at all is a good place to be.
I'll do some investigation in bug 1360171.
Comment 19•8 years ago
|
||
sorry had to back this out for crashes like https://treeherder.mozilla.org/logviewer.html#?job_id=95032464&repo=mozilla-inbound&lineNumber=2201 & https://treeherder.mozilla.org/logviewer.html#?job_id=95010858&repo=mozilla-inbound&lineNumber=1144
Flags: needinfo?(dvander)
Comment 20•8 years ago
|
||
Backout by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8b70deabfe10
Backed out changeset f7685ecd789d for marionette crashes
Comment 21•8 years ago
|
||
FYI, the reason for the backout were crashes. But those are already known for a while and happen intermittently. See bug 1338422 and others under see also for details.
| Assignee | ||
Comment 22•8 years ago
|
||
That assert looks maybe bogus to me, I don't think the checks that precede it are enough. We clear |LayerTreeState::mParent| in StopAndClearResources, but rely on the RenderFrameParent::ActorDestroy to actually unmap the layer tree ID. So we could have a mapped ID, but have lost our compositor, if for example the window is destroyed while a tab is loading.
To verify this I put a sleep(30) in InitAPZState in the content process, opened a few windows and then closed them all. The UI process crashed.
Flags: needinfo?(dvander)
| Assignee | ||
Comment 23•8 years ago
|
||
Filed bug 1360478 separately since it will be a good candidate for uplift.
Comment 24•8 years ago
|
||
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/867fcd3e181d
Make PLayerTransaction's constructor async. (bug 1350634, ipc_r=billm, r=mattwoodrow, r=kats)
Comment 25•8 years ago
|
||
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/084bebaffd26 for leaking in sessionstore's browser-chrome tests, asan: https://treeherder.mozilla.org/logviewer.html#?job_id=95582368&repo=mozilla-inbound, windows: https://treeherder.mozilla.org/logviewer.html#?job_id=95586172&repo=mozilla-inbound (wouldn't be surprised if it also would have leaked on Mac and Linux, but we couldn't be arsed to actually run the chunks of browser-chrome with the sessionstore tests in them on anything but asan and windows).
Comment 26•8 years ago
|
||
Perhaps the code here:
if (!success) {
NS_WARNING("failed to re-allocate layer transaction");
return;
}
should be cleaning this up?
| Assignee | ||
Comment 27•8 years ago
|
||
(In reply to Bill Gianopoulos [:WG9s] from comment #26)
> Perhaps the code here:
>
> if (!success) {
> NS_WARNING("failed to re-allocate layer transaction");
> return;
> }
>
> should be cleaning this up?
Hrm, I don't think that code is related.
It's possible that making this non-sync has exposed timing issues that make the fix for bug 1360478 leaky. bug 1337062 had a similar issue. But I'm not yet sure how this leak could happen. CPCBP blocks the compositor thread from shutting down until all of its actors have been cleaned up, and there's nothing fancy about this actor. It's allocated and deleted unconditionally.
| Assignee | ||
Comment 28•8 years ago
|
||
Oh, I know what's happening. Making this message asynchronous makes it much more likely (still rare, but much more likely) that we'll actually hit the scenario in bug 1360478, since the tab can issue the PAPZCTreeManager message a split second faster. That bug has a leak: we never clear mFlushObserver on the APZCTM, so it effectively retains a reference to itself. It's just extremely rare we'd see it on TreeHerder without this patch.
Kats, here's a few routes we could go to fix this:
(1) Allow APZCTreeManagerParent to have a null APZCTreeManager, and null check in each callback.
(2) Call ClearTree immediately after allocating the temp APZCTM.
(3) Have some kind of alternate constructor that doesn't create the observer.
Any preference?
Flags: needinfo?(bugmail)
Comment 29•8 years ago
|
||
(In reply to David Anderson [:dvander] from comment #28)
> Kats, here's a few routes we could go to fix this:
> (1) Allow APZCTreeManagerParent to have a null APZCTreeManager, and null
> check in each callback.
> (2) Call ClearTree immediately after allocating the temp APZCTM.
> (3) Have some kind of alternate constructor that doesn't create the
> observer.
>
> Any preference?
My preference is for (2). Second choice would be (3), but I'd like to stay away from (1) as it incurs an overhead for the common case just to handle this edge case behaviour.
Flags: needinfo?(bugmail)
Comment 30•8 years ago
|
||
Pushed by danderson@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8d5920dc4472
Make PLayerTransaction's constructor async. (bug 1350634, ipc_r=billm, r=mattwoodrow, r=kats)
Comment 31•8 years ago
|
||
| bugherder | ||
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
status-firefox55:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Updated•3 years ago
|
Performance Impact: --- → ?
Whiteboard: [gfx-noted][qf:investigate] → [gfx-noted]
You need to log in
before you can comment on or make changes to this bug.
Description
•