Open Bug 1334655 Opened 4 years ago Updated 2 years ago
Bridge::Msg _Will Close can block the UI thread when closing a window
See this profile: <https://clptr.io/2jFN1SY> We start by a sync IPC from the parent main thread to the compositor thread for about 500ms and then the compositor thread somehow disappears from the profile for about 10 *seconds* after which things resume both on the compositor and UI thread sides. I have no idea what is happening in this profile at the time when the samples disappear. Markus, do you know?
Priority: -- → P3
Actually we are doing some cleanup in  but maybe we have chance to return early to main thread. Jerry, any idea? http://searchfox.org/mozilla-central/source/gfx/layers/ipc/CompositorBridgeParent.cpp#445
Assignee: nobody → hshih
Hi Markus, This is the interval that we can't see the sample on both main and compositor thread. https://perf-html.io/public/1924d93901fbe4000cc018179115ef67cedea68d/calltree/?range=385.2865_432.0684~389.0925_391.3270&thread=0 The more interesting thing is that we can see the "composite", "layer transaction" and "rasterize" markers, but there is no sample shown in profiler. Are gecko running out of the sampling buffer?
Assignee: hshih → nobody
We pause the profiler just before collecting the profile, and then resume it afterwards. I believe this gap is caused by such a pause. I think Ehsan had the profiler running, collected a profile (which paused the profiler for a bit), interacted with the browser some more, and then collected another profile. The first collection is what caused the gap, the second collection resulted in the profile you're looking at.
Oh, and we still collect markers while the profiler is paused, for some reason. Maybe we should fix that.
Hi Ehsan, Is the PCompositorBridge::Msg_WillClose() blocking problem shown in the profiler? Are we going to fix the Msg_WillClose() problem in this bug?
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #5) > Hi Ehsan, > Is the PCompositorBridge::Msg_WillClose() blocking problem shown in the > profiler? Yeah, all you have to do is search for Msg_WillClose in the profile. > Are we going to fix the Msg_WillClose() problem in this bug? I don't understand your question, I'm afraid. Isn't that what I filed the bug for? The issue about lack of the profile samples in that 10 second range is not relevant here, it just makes it hard to tell what happened after Msg_WillClose.
The most of time used in Msg_WillClose() is at gl shader deletion. It's about 400ms. Jeff, any idea for this situation?  gl shader deletion: https://dxr.mozilla.org/mozilla-central/rev/d0462b0948e0b1147dcce615bddcc46379bdadb2/gfx/layers/opengl/CompositorOGL.cpp#190  profiler: https://perf-html.io/public/1924d93901fbe4000cc018179115ef67cedea68d/calltree/?range=385.2865_432.0684~388.2881_389.0031&thread=1
Please search "glDeleteObjectARB_Exec" in profiler.
Deleting shaders shouldn't be that slow. Not sure what's going on in that code, specifically.
Flags: needinfo?(jgilbert) → needinfo?(jmuizelaar)
Jerry can you check how many shaders we're deleting? I was looking at Safari and it deletes shaders pretty often so it seems like there's no obviously slow about deletion.
Flags: needinfo?(jmuizelaar) → needinfo?(hshih)
Hi Ehsan, Could you please check the map size in comment 0? What's the total time for the all shaders deletion? Is it about 400ms? Could you show the test url here?  https://dxr.mozilla.org/mozilla-central/rev/d0462b0948e0b1147dcce615bddcc46379bdadb2/gfx/layers/opengl/CompositorOGL.cpp#190
Flags: needinfo?(hshih) → needinfo?(ehsan)
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #11) > Hi Ehsan, > Could you please check the map size in comment 0? I can't really go back in time to when I captured the profile and investigate how many shaders we had there. :-) > What's the total time > for the all shaders deletion? Is it about 400ms? Could you show the test url > here? Sounds like these are the types of questions that you want to answer by adding telemetry probes and gathering real world data on.
I haven't checked the shaders deletion time locally yet. The total shader object number is not fixed. The shader object is created on demand. And I'm afraid that it's related to device issue(apple intel hd5000 gl graphic driver in the profiler). If there is test url to test the shutdown time, I can try to capture the profiler again to see the difference.
I don't have a test URL unfortunately. I've seen this a few times in profiles after I have filed the bug, and I'll try to see if I have more info to share the next time I hit this. Still I think we should add telemetry here.
Jerry, what is the next step here? It's not clear to me if you're actively working on this? If not, we need to find an owner. Thank you.
(In reply to :Ehsan Akhgari from comment #14) > I don't have a test URL unfortunately. I've seen this a few times in > profiles after I have filed the bug, and I'll try to see if I have more info > to share the next time I hit this. Still I think we should add telemetry > here. Ehsan, could you share the about:support of your testing machine(with discrete or integrated GPU)? Then we can try to try some WebGL links to reproduce shader deletion in comment 7. If we still couldn't reproduce it, we can consider to log shader deletion into telemetry.
Flags: needinfo?(hshih) → needinfo?(ehsan)
Michael, Could you please try to log the gl shader number and shader deletion time with mbpr 2013(comment 7)? I see the call-stack with "Apple intel hd 5000 driver".
I opened couple of tabs for random websites and close the window. It deletes 11 shaders, their deletion time is too little for PR_IntervalNow() so I got all 0 here. Do you have some particular case which can reproduce the issue?
I used a late-2013 Macbook pro 15-inch but got no luck reproducing it. I think this issue is driver-related. Hi Ehsan, can you post your about:support here for a better diagnose of this bug?
FWIW, this bug is moving a bit off the rails, let me try to re-focus the effort. What matters the most here is for us to drive the cost of this for the majority of our users, not for me specifically. There is no reason at all to keep focusing on anything about my machine. We have some generic telemetry about these sync IPC issues. For example, this is the current data on Aurora: Start End IPC_SYNC_LATENCY_MS Count 0 1 0 (0%) 1 3 419.31k (76.44%) 3 8 69.31k (12.64%) 8 20 35.48k (6.47%) 20 50 14.07k (2.56%) 50 126 6.75k (1.23%) 126 317 2.45k (0.45%) 317 796 846 (0.15%) 796 2k 253 (0.05%) 2k Infinity 73 (0.01%) The telemetry that we collect ignores sync IPC issues that finish really quickly (currently on Aurora less than 1ms). The current data shows that a ton of these probes take some time between 1-3 ms, but a good number of them can jank us for several frames. Note that this can happen during closing a window (can this happen on any other code path also?) where we have several other expensive operations and the cost of everything on this path adds up making the overall performance of closing a window less than ideal. Since the work here happens on the UI thread and locks up the compositor as well, any significant amount of time we spend deleting shaders can have disastrous performance consequences.
The WillClose only happened when users tried to close the window. Based on the discussion from Jerry and nical, one purpose of sync IPC is to make sure the requests from child side are handled properly by parent. But there comes a question that which share resource we need this sync IPC approach to protect. http://searchfox.org/mozilla-central/source/gfx/layers/ipc/CrossProcessCompositorBridgeParent.h#54
Trying to understand this bug; the particular sync IPC being mentioned is Mac only. On the other hand, comment 21 talks about sync IPC in more general terms. Is this bug "fix all graphics related sync IPC on close on all platforms"?
I think this 'WillClose' sync IPC happens on all paltforms.
(In reply to Peter Chang[:pchang] from comment #24) > I think this 'WillClose' sync IPC happens on all platforms. Comment 7 shows the time spent in GL shader deletion on the Mac - do we have a profile for Windows? Do we know how long it takes on Windows?
Before comment 21, we thought this was caused by GL shader deletion based on ehsan's profile but we couldn't reproduce this in taipei with Mac or Windows machine. But looking the data in comment 21, it mentioned about the latency distribution that caused by this sync IPC on all platforms. Therefore, we might need to think about how to improve this sync IPC or change to async approach.
Assign to Jerry and he will start to work on this after 4/10.
Assignee: nobody → hshih
(In reply to Milan Sreckovic [:milan] from comment #23) > Trying to understand this bug; the particular sync IPC being mentioned is > Mac only. On the other hand, comment 21 talks about sync IPC in more > general terms. I don't see anything in the discussion this bug which seems Mac specific. I don't really know much about the code that is sending this IPC message, but since it's only called from one place it was easy to see where it's coming from. This code was added in bug 1215265 (see this code <http://searchfox.org/mozilla-central/rev/7419b368156a6efa24777b21b0e5706be89a9c2f/gfx/layers/ipc/CompositorBridgeChild.cpp#166>) and based on the discussion in that bug, this should affect all platforms. > Is this bug "fix all graphics related sync IPC on close on all platforms"? This is the only graphics sync IPC bug that happens when closing a window that I'm aware of currently.
Assignee: hshih → dvander
Ehsan, if we're only collecting how often these messages jank, it seems like we wouldn't know how often they *don't* jank. To find out if this really impacts users we'd have to correlate it with another metric, like how often a window is closed and then brings a background window into the foreground. This message was synchronous long before bug 1215265 - in fact that one removed a sync message. The original sync message was added in  which helpfully has no bug or commit message. I think in modern IPDL this one might not be so hard to remove, I'll do a try run and see. The tricky part is that before the GPU process, the compositor relied on that sync message scoping the lifetime of its resources. If the main thread destroys the window before an in-process compositor "forgets" it, we'll crash. But it should be possible to negotiate shutdown asynchronously.
No longer blocks: 1215265
This is a good thing to do on top of bug 1349699, to make sure we're not about to add more IPDL crashes.
This is proving to be not so easy. The X11 compositor flat out crashes if it races with a widget that is closing (assumedly, because X11 resources have gone away). Our choices are probably (1) have platform specific shutdown logic, or (2) make WillClose async, but refuse to destroy a composited widget until the compositor has sent back an ACK message. I would expect either one to cause problems with general browser shutdown, since layers expects to be cleaned up by xpcom-shutdown time, which would not be true anymore. I'd like to do some Telemetry first to make sure this is actually worth fixing - I think we already collect enough data to make a correlation.
Based on comment 16, I add a telemetry for measuring shader deletion time. Since the number of shader is not fixed, the data is classified according to the number. I've tried to measured it by opening some websites locally and the shader number was around 10 and the deletion time was 1 ms; therefore, I expect the data will mainly in first two categories but I still defined some categories cases with more shader number. Is this what we need for the telemetry?
(In reply to Kevin Chen[:kechen] (UTC + 8) from comment #33) > Created attachment 8856348 [details] [diff] [review] > Add telemetry for shader deletion. > > Based on comment 16, I add a telemetry for measuring shader deletion time. > Since the number of shader is not fixed, the data is classified according to > the number. > I've tried to measured it by opening some websites locally and the shader > number was around 10 and the deletion time was 1 ms; therefore, I expect the > data will mainly in first two categories but I still defined some categories > cases with more shader number. > > Is this what we need for the telemetry? I don't think it answers the question, "how important is fixing this bug?". The D3D11 compositor doesn't delete shaders until the last window is closed, and I'm not too worried about dropping synchronous messages on shutdown. What we really want is to know how often these messages are delaying a background window coming into the foreground.
(In reply to David Anderson [:dvander] from comment #29) > Ehsan, if we're only collecting how often these messages jank, it seems like > we wouldn't know how often they *don't* jank. To find out if this really > impacts users we'd have to correlate it with another metric, like how often > a window is closed and then brings a background window into the foreground. Yes, that's a good point. The telemetry here is just a crude generic probe that measures all long running sync IPC, but in this case I agree that it's not enough data. I think it makes sense to get a sense of how often this janks out of the number of times that janks against the sync IPC call happening and not janking (I think that's what the "window is closed and bring up a background window into the foreground" case roughly would be?) Please note that for the UX team, the performance of the window closing is critical, i.e., they want the window to disappear as soon as the user presses the X button. So this bug would be important to fix if it even slows down a small-ish number of cases significantly, or if it slows down a large number of cases but by a little bit only (but enough to make things perceivable added on top of other slowness). The hard question would be to turn this into some precise thresholds of course. (Not that this should block getting the telemetry!) > This message was synchronous long before bug 1215265 - in fact that one > removed a sync message. The original sync message was added in  which > helpfully has no bug or commit message. Ugh. :( (In reply to David Anderson [:dvander] from comment #32) > This is proving to be not so easy. The X11 compositor flat out crashes if it > races with a widget that is closing (assumedly, because X11 resources have > gone away). Our choices are probably (1) have platform specific shutdown > logic, or (2) make WillClose async, but refuse to destroy a composited > widget until the compositor has sent back an ACK message. I would expect > either one to cause problems with general browser shutdown, since layers > expects to be cleaned up by xpcom-shutdown time, which would not be true > anymore. Couldn't we use an async shutdown blocker for that? They're pretty neat, they allow you to force some cleanup C++ code to run before xpcom-shutdown, so we can wait for any remaining widgets to be destroyed before allowing the shutdown to proceed. See <http://searchfox.org/mozilla-central/rev/624d25b2980cf5f83452b040e6d664460f9b7ec5/dom/media/MediaManager.cpp#1807> for an example.
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #35) > Couldn't we use an async shutdown blocker for that? They're pretty neat, > they allow you to force some cleanup C++ code to run before xpcom-shutdown, > so we can wait for any remaining widgets to be destroyed before allowing the > shutdown to proceed. See > <http://searchfox.org/mozilla-central/rev/ > 624d25b2980cf5f83452b040e6d664460f9b7ec5/dom/media/MediaManager.cpp#1807> > for an example. That *is* neat, thanks! That might make this easier.
Hrm, I don't think an async shutdown blocker will help that much. If we can't destroy some widgets until the compositor shuts down, then we're still delaying the visual effect of the window closing. Perhaps longer than if we just synchronously blocked. When I query this in the telemetry backend, 96% of IPC_SYNC_MAIN_LATENCY_MS values for Msg_WillClose are < 32ms. That's not a terribly useful statistic, but it does suggest things aren't that bad. Very few sessions are only in the awful buckets. There's no real correlation with compositor type. However, there is a correlation with the GPU process: only 0.8% of sessions with any Msg_WillClose statistic have the GPU process enabled - versus 20% in the full population I queried. Does this suggest the main process is way too busy and the GPU process is alleviating the load? That wouldn't surprise me but it totally wasn't an expected outcome. Anyway, I'm doing some try runs to see if I can make WillClose a Linux/Mac thing and drop it for Windows. I think I sorted out the layers/xpcom-shutdown issue.
Actually I take that back, I didn't realize IPC_SYNC_MAIN_LATENCY_MS is such a new statistic. When I limit the population to that, there's no correlation. But oddly, only 0.7% of sessions with that histogram appearing have the GPU process enabled. That is very odd and suggests we accidentally turned it off or something.
Priority: P3 → --
(In reply to David Anderson [:dvander] from comment #38) > Actually I take that back, I didn't realize IPC_SYNC_MAIN_LATENCY_MS is such > a new statistic. This probe was rename from IPC_SYNC_LATENCY_MS in bug 1337073. Does that help make sense of the data?
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #39) > (In reply to David Anderson [:dvander] from comment #38) > > Actually I take that back, I didn't realize IPC_SYNC_MAIN_LATENCY_MS is such > > a new statistic. > > This probe was rename from IPC_SYNC_LATENCY_MS in bug 1337073. Does that > help make sense of the data? Nope... I'm legitimately confused. I re-ran my telemetry job and IPC_SYNC_MAIN_LATENCY_MS in the parent process is highly correlated with not having the GPU process. Users with the GPU process are extremely unlikely to have high-latency sync messages. (And users with high-latency sync messages are extremely unlikely to have the GPU process.)
That is - in the UI process. I didn't do any correlation for sync messages from content, I'm assuming there isn't one but I will check.
(In reply to David Anderson [:dvander] from comment #34) > (In reply to Kevin Chen[:kechen] (UTC + 8) from comment #33) > > > > Is this what we need for the telemetry? > > I don't think it answers the question, "how important is fixing this bug?". > The D3D11 compositor doesn't delete shaders until the last window is closed, > and I'm not too worried about dropping synchronous messages on shutdown. > What we really want is to know how often these messages are delaying a > background window coming into the foreground. I still think we should fix the long shader deletion time problem. The ~compositor() is running at compositor thread. If there is a ~compositor() call spending 500ms, all rendering task will be blocked for 500ms. Add the data into telemetry could give us the impact factor for ~compositor().
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #42) > (In reply to David Anderson [:dvander] from comment #34) > > (In reply to Kevin Chen[:kechen] (UTC + 8) from comment #33) > > > > > > Is this what we need for the telemetry? > > > > I don't think it answers the question, "how important is fixing this bug?". > > The D3D11 compositor doesn't delete shaders until the last window is closed, > > and I'm not too worried about dropping synchronous messages on shutdown. > > What we really want is to know how often these messages are delaying a > > background window coming into the foreground. > > I still think we should fix the long shader deletion time problem. The > ~compositor() is running at compositor thread. If there is a ~compositor() > call spending 500ms, all rendering task will be blocked for 500ms. Add the > data into telemetry could give us the impact factor for ~compositor(). Can you file a separate bug for fixing that?
After a lot of unsuccessful try runs, this is still looking difficult to fix. Given we're focusing on Windows 8+ and E10S, and the data in comment #40 suggests the GPU process is a big win for UI sync messages, it should be enough to try turning the GPU process on for more users. We're aiming to do that in bug 1356448, so after that has had time to settle, I'll re-check the stats.
Whiteboard: [qf:p1][gfx-noted] → [qf:p2][gfx-noted]
(In reply to David Anderson [:dvander] from comment #44) > it should be > enough to try turning the GPU process on for more users. We're aiming to do > that in bug 1356448, so after that has had time to settle, I'll re-check the > stats. (Looks like bug 1356448 hit Nightlies 9 days ago. Not sure if that's enough "time to settle" :), but tagging you for needinfo to follow up on this once you get a chance. Thanks!)
Priority: -- → P3
I filed bug 1370690 to re-do that analysis.
Assignee: dvander → nobody
The analysis seems to indicate this is not a significant issue anymore. I think this bug can stay open as making the message async altogether, but it's not a priority anymore.
Summary: PCompositorBridge::Msg_WillClose can lock up the UI thread for a long time when closing a window → PCompositorBridge::Msg_WillClose can block the UI thread when closing a window
Whiteboard: [qf:p2][gfx-noted] → [qf:p3][gfx-noted]
Whiteboard: [qf:p3][gfx-noted] → [qf:p3][gfx-noted][fxperf]
You need to log in before you can comment on or make changes to this bug.