1355416 - Long composition times with Google blog on Windows coming from GetDeviceRemovedReason

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Description

•

8 years ago

See bug 1342209 comment 11 for more details and the background. We use GetDeviceRemovedReason in a number of places, and there are indications in the above comment that there may be scenarios where this could take a long time. We should investigate and find out the details of the problem, and if we have an alternative.

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 1

•

8 years ago

Flow should evaluate this for priority.

Assignee: nobody → howareyou322

Whiteboard: [gfx-noted][qf]

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 3

•

8 years ago

(In reply to Milan Sreckovic [:milan] from comment #1) > Flow should evaluate this for priority. Spending 6ms during composition in a function that is only there to deal with rare error conditions is excessive. Most of the time that's equivalent to sleep(), so we want to at least understand what's going on and hopefully fix it.

Peter Chang[:pchang]

Comment 4

•

8 years ago

Assign Jerry to investigate.

Assignee: howareyou322 → hshih

Jerry Shih[:jerry] (UTC+8) (inactive)

Comment 5

•

8 years ago

In https://bugzilla.mozilla.org/show_bug.cgi?id=1342209#c11 > This is the profile with e10s enabled. > https://perfht.ml/2oYHd99 > I found several composition took more than 10 ms and they were spent 60% composition time for > GetDeviceRemovedReason inside > CompositorD3D11::EndFrame(). (In reply to Milan Sreckovic [:milan] from comment #3) > (In reply to Milan Sreckovic [:milan] from comment #1) > > Flow should evaluate this for priority. > > Spending 6ms during composition in a function that is only there to deal > with rare error conditions is excessive. Most of the time that's equivalent > to sleep(), so we want to at least understand what's going on and hopefully > fix it. The profile data show "the total time" of GetDeviceRemovedReason() is 10ms, but I think it contains several GetDeviceRemovedReason() calls. Not just a 10ms GetDeviceRemovedReason() call. This show a 3ms call: https://perf-html.io/public/38208b64d0ea1f87ae6cf1b5e0224d694c574176/calltree/?range=12.3444_14.0369~13.1493_13.5532~13.2697_13.2849~13.2779_13.2829&thread=4 And this is the 2ms call: https://perf-html.io/public/38208b64d0ea1f87ae6cf1b5e0224d694c574176/calltree/?range=12.3444_14.0369~13.1493_13.5532~13.2697_13.2849~13.2719_13.2750&thread=4 Anyway, we check the device status at BeginFrame() and EndFrame(). We could probably skip the check in EndFrame(). We check the status to skip the whole frame painting in BeginFrame() when the device is abnormal. And it looks like we only release the readLock() inside the EndFrame(). It's not related to the d3d device.

Status: NEW → ASSIGNED

(no longer active)

Comment 6

•

8 years ago

Looking at the profiles in comment 5, it does seem like the GetDeviceRemovedReason() call can push the compositor towards taking more than a frame budget per composition which sounds pretty bad, so I'm marking this as [qf:p1] based on that (not quite sure how commonly this call can occur though...)

(no longer active)

Updated

•

8 years ago

Whiteboard: [gfx-noted][qf] → [gfx-noted][qf:p1]

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 7

•

8 years ago

(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #6) > Looking at the profiles in comment 5, it does seem like the > GetDeviceRemovedReason() call can push the compositor towards taking more > than a frame budget per composition which sounds pretty bad, so I'm marking > this as [qf:p1] based on that (not quite sure how commonly this call can > occur though...) Agreed - we call this twice per frame. Even if "only" 2-3ms, it's essentially a "sleep". We may consider the 'nuclear' option where, if with GPU process, we skip those checks completely, let the GPU process crash on device reset, and just switch to software at that point. At least there would be no penalty when there are no device resets. It's probably worth putting that kind of approach behind a preference and doing a telemetry experiment.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 8

•

8 years ago

As a data point, the BHR from 20170405 has 111 stacks like this: % grep GetDeviceRemovedReason hangs.json|wc -l 111 "CDevice::GetDeviceRemovedReason() (in d3d11.pdb)", "mozilla::gfx::DeviceManagerDx::GetAnyDeviceRemovedReason(DeviceResetReason *) (in xul.pdb)", "mozilla::gfx::DeviceManagerDx::HasDeviceReset(DeviceResetReason *) (in xul.pdb)", "gfxWindowsPlatform::SchedulePaintIfDeviceReset() (in xul.pdb)", "nsRefreshDriver::Tick(__int64,mozilla::TimeStamp) (in xul.pdb)", "mozilla::RefreshDriverTimer::TickDriver(nsRefreshDriver *,__int64,mozilla::TimeStamp) (in xul.pdb)", "mozilla::RefreshDriverTimer::TickRefreshDrivers(__int64,mozilla::TimeStamp,nsTArray<RefPtr<nsRefreshDriver> > &) (in xul.pdb)", "mozilla::RefreshDriverTimer::Tick(__int64,mozilla::TimeStamp) (in xul.pdb)", "mozilla::VsyncRefreshDriverTimer::RunRefreshDrivers(mozilla::TimeStamp) (in xul.pdb)", "mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(mozilla::TimeStamp) (in xul.pdb)", "mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyVsync(mozilla::TimeStamp) (in xul.pdb)", "mozilla::layout::VsyncChild::RecvNotify(mozilla::TimeStamp const &) (in xul.pdb)", "mozilla::layout::PVsyncChild::OnMessageReceived(IPC::Message const &) (in xul.pdb)", "mozilla::ipc::PBackgroundChild::OnMessageReceived(IPC::Message const &) (in xul.pdb)", "mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const &) (in xul.pdb)", "mozilla::ipc::MessageChannel::DispatchMessageW(IPC::Message &&) (in xul.pdb)", "mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask &) (in xul.pdb)", "mozilla::ipc::MessageChannel::MessageTask::Run() (in xul.pdb)", "nsThread::ProcessNextEvent(bool,bool *) (in xul.pdb)", "NS_ProcessNextEvent(nsIThread *,bool) (in xul.pdb)", "mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) (in xul.pdb)", "mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate *) (in xul.pdb)", "MessageLoop::RunHandler() (in xul.pdb)", "MessageLoop::Run() (in xul.pdb)", "nsBaseAppShell::Run() (in xul.pdb)" So this may be blocking main thread too.

Kan-Ru Chen [:kanru] (UTC+9)

Comment 9

•

8 years ago

(In reply to Kan-Ru Chen [:kanru] (UTC-4, EDT) from comment #8) > So this may be blocking main thread too. The above stack means this function showed up in background hang reports as being the currently running function after 8 seconds of main thread hanging.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 10

•

8 years ago

I would be okay just dropping all the DeviceRemoved calls to be honest. If it crashes we can add them back and re-think our strategy. There are so many places we *don't* call it it's not clear why randomly calling it once per frame would be reliable. It'd make more sense to only call it if some other error occurred. And even with the GPU process, this still a content crash, so we can't really limit it based on that.

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Comment 11

•

8 years ago

It's really the checks in BeginFrame/EndFrame (https://hg.mozilla.org/mozilla-central/rev/34fd4c7639ec). The rest are only made if a DX call fails. In the BeginFrame case, we could probably move the DeviceRemoved check if UpdateRenderTarget call fails further down in the function. And remove it from EndFrame completely.

Kevin Chen[:kechen] (UTC + 8)

Updated

•

8 years ago

Depends on: 1356537

Bas Schouten (:bas.schouten)

Assignee

Comment 12

•

8 years ago

It's -very- important here to realize that all this call is doing is increase CPU usage... the compositor thread could not be doing anything else during this time. You can change the Sleep in this function to be a Sleep(1) (something I've considered as a valid step, I have Markus an executable to test on a variety of the Toronto QA lab systems to see if Sleep(1) returns within 2ms consistently as it does on my machine) and take away the CPU usage, since it would call GetDeviceRemovedReason a lot less, but it would -still- need to loop there and block the compositor, because we can't get more than 1 frame ahead of the GPU for numerous reasons (and there's no point, since we wouldn't be displaying anything we do anyway). i.e. this is not caused by a single GetDeviceRemovedReason taking a long time, this is just that in that loop, waiting for the compositor to finish, that call takes the vastly longer amount of time. So to be clear: Mitigating the GetDeviceRemovedReason cost in EndFrame in particular, has no user-facing performance benefit other than improved battery life and lower CPU usage. I'm going to make sure you have both seen this as I missed this bug.

Flags: needinfo?(kchen)

Flags: needinfo?(dvander)

Bas Schouten (:bas.schouten)

Assignee

Comment 13

•

8 years ago

This should be re-triaged since the slow composition here doesn't come from GetDeviceRemovedReason usage, it comes from the GPU being busy doing 'something' else.

Whiteboard: [gfx-noted][qf:p1] → [gfx-noted][qf]

Bas Schouten (:bas.schouten)

Assignee

Comment 14

•

8 years ago

(In reply to David Anderson [:dvander] from comment #10) > I would be okay just dropping all the DeviceRemoved calls to be honest. If > it crashes we can add them back and re-think our strategy. > > There are so many places we *don't* call it it's not clear why randomly > calling it once per frame would be reliable. It'd make more sense to only > call it if some other error occurred. And even with the GPU process, this > still a content crash, so we can't really limit it based on that. The reason we call it in EndFrame is because the Query may not ever return in rare cases of a device removal, and as such we would block the compositor thread forever.

Bas Schouten (:bas.schouten)

Assignee

Comment 15

•

8 years ago

(In reply to Jerry Shih[:jerry] (UTC+8) from comment #5) > In https://bugzilla.mozilla.org/show_bug.cgi?id=1342209#c11 > > This is the profile with e10s enabled. > > https://perfht.ml/2oYHd99 > > I found several composition took more than 10 ms and they were spent 60% composition time for > > GetDeviceRemovedReason inside > > CompositorD3D11::EndFrame(). > > (In reply to Milan Sreckovic [:milan] from comment #3) > > (In reply to Milan Sreckovic [:milan] from comment #1) > > > Flow should evaluate this for priority. > > > > Spending 6ms during composition in a function that is only there to deal > > with rare error conditions is excessive. Most of the time that's equivalent > > to sleep(), so we want to at least understand what's going on and hopefully > > fix it. > > The profile data show "the total time" of GetDeviceRemovedReason() is 10ms, > but I think it contains several GetDeviceRemovedReason() calls. Not just a > 10ms GetDeviceRemovedReason() call. > > This show a 3ms call: > https://perf-html.io/public/38208b64d0ea1f87ae6cf1b5e0224d694c574176/ > calltree/?range=12.3444_14.0369~13.1493_13.5532~13.2697_13.2849~13.2779_13. > 2829&thread=4 > > And this is the 2ms call: > https://perf-html.io/public/38208b64d0ea1f87ae6cf1b5e0224d694c574176/ > calltree/?range=12.3444_14.0369~13.1493_13.5532~13.2697_13.2849~13.2719_13. > 2750&thread=4 > > Anyway, we check the device status at BeginFrame() and EndFrame(). We could > probably skip the check in EndFrame(). We check the status to skip the whole > frame painting in BeginFrame() when the device is abnormal. And it looks > like we only release the readLock() inside the EndFrame(). It's not related > to the d3d device. It should be noted even these are highly unlikely to show 3ms or 2ms calls (although I suppose it's possible but when measuring the actual time of the function on my machine, even when the GPU is busy, it gets nowhere near that time), but much more likely, these are just 3 subsequent samples which all happen to sample that function, because it's such a large part of the loop. More importantly though, when we hit this codepath, the GPU is already a full frame behind on work, so even if we take a couple of ms on the compositor thread, since issueing the next frame is fast, in most situations this probably won't have a user-facing effect. Understanding what the GPU is doing to get this far behind is useful though, but a more difficult and very different type of investigation.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 16

•

8 years ago

(In reply to Bas Schouten (:bas.schouten) from comment #14) > (In reply to David Anderson [:dvander] from comment #10) > > I would be okay just dropping all the DeviceRemoved calls to be honest. If > > it crashes we can add them back and re-think our strategy. > > > > There are so many places we *don't* call it it's not clear why randomly > > calling it once per frame would be reliable. It'd make more sense to only > > call it if some other error occurred. And even with the GPU process, this > > still a content crash, so we can't really limit it based on that. > > The reason we call it in EndFrame is because the Query may not ever return > in rare cases of a device removal, and as such we would block the compositor > thread forever. What do you mean by that? Don't we have a timeout on compositor queries?

Flags: needinfo?(dvander)

Bas Schouten (:bas.schouten)

Assignee

Comment 17

•

8 years ago

(In reply to David Anderson [:dvander] from comment #16) > (In reply to Bas Schouten (:bas.schouten) from comment #14) > > (In reply to David Anderson [:dvander] from comment #10) > > > I would be okay just dropping all the DeviceRemoved calls to be honest. If > > > it crashes we can add them back and re-think our strategy. > > > > > > There are so many places we *don't* call it it's not clear why randomly > > > calling it once per frame would be reliable. It'd make more sense to only > > > call it if some other error occurred. And even with the GPU process, this > > > still a content crash, so we can't really limit it based on that. > > > > The reason we call it in EndFrame is because the Query may not ever return > > in rare cases of a device removal, and as such we would block the compositor > > thread forever. > > What do you mean by that? Don't we have a timeout on compositor queries? We do indeed. I suppose this particular one could be removed and we could live with the 2 second timeout rather than recovering right away. Anyway, it doesn't solve any issue to remove it, you'd just be spending all time in the actual Query itself and the profile would change but nothing else would :-).

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 18

•

8 years ago

One argument to removing the GetDeviceRemovedReason call could be that it gives us another an extra ms or two to process a layers update before the next composite. Out of curiosity, why does this call even block on the GPU to begin with? Does DirectX not just store this in a variable and update it whenever the reset occurs?

Vincent Liu[:vliu]

Updated

•

8 years ago

Comment 19

•

8 years ago

Yes, as David said it's not about reducing CPU usage but to remove the time blocked in GetDeviceRemovedReason whether it's a sleep or not. I saw this in the BHR report and it's on main thread. I'm not sure if it's real problem so sure we can retriage.

Flags: needinfo?(kchen)

Naveed Ihsanullah [:naveed]

Comment 20

•

8 years ago

How do we determine if this impacts users?

Flags: needinfo?(bas)

Bas Schouten (:bas.schouten)

Assignee

Comment 21

•

8 years ago

(In reply to David Anderson [:dvander] from comment #18) > One argument to removing the GetDeviceRemovedReason call could be that it > gives us another an extra ms or two to process a layers update before the > next composite. > > Out of curiosity, why does this call even block on the GPU to begin with? > Does DirectX not just store this in a variable and update it whenever the > reset occurs? We block on the GPU here because the GPU is more than a full frame behind on drawing, and if we'd get more than a full frame ahead of the GPU we'd get in trouble with double buffering (and in any case the user isn't -seeing- anything new anyway, so issueing more frames is sort of pointless at this point). (In reply to Naveed Ihsanullah [:naveed] from comment #20) > How do we determine if this impacts users? This itself does not impact users. -However- it is indicative of some other issue that -is- visible to users. For some reason, and the reason itself is very difficult to determine, it could just be that some other process on the machine, or Firefox itself, is using all the memory bandwidth, it could be we're trying to composite way too many pixels (this seems unlikely but it's possible). A more advanced method of profiling would be required to determine what's going on.

Flags: needinfo?(bas)

Bas Schouten (:bas.schouten)

Assignee

Comment 22

•

8 years ago

I will try a little patch here.

Assignee: hshih → bas

Mike Conley (:mconley) (:⚙️)

Updated

•

8 years ago

Updated

•

8 years ago

Comment 23

•

8 years ago

I ran some tests, I will only be able to fix this matter on Windows 8+, the tick-based kernel of Windows 7 means that I can't sleep for less than 10ms, which is not a risk I'm willing to take for this release. (It might be fine, since we're more than a frame, i.e. 16.67ms behind, however with 10ms we will potentially severely delay the shipping of the next frame off to the GPU).

Bas Schouten (:bas.schouten)

Assignee

Updated

•

8 years ago

Whiteboard: [gfx-noted][qf] → [gfx-noted][qf:p3]

Milan Sreckovic [:milan] (needinfo for best results)

Reporter

Updated

•

7 years ago

Priority: -- → P3

Firefox Bug Husbandry Bot

Updated

•

7 years ago

Keywords: perf

Bas Schouten (:bas.schouten)

Assignee

Updated

•

4 years ago

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

Resolution: --- → WONTFIX

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P3

Whiteboard: [gfx-noted][qf:p3] → [gfx-noted]