Closed Bug 1217185 Opened 9 years ago Closed 9 years ago

e10s video performance is very bad

Categories

(Core :: Audio/Video: Playback, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
e10s ? ---
firefox43 --- affected
firefox44 --- affected
firefox45 --- fixed

People

(Reporter: cpeterson, Assigned: bobowen)

References

()

Details

Attachments

(1 file)

Anthony's video performance test shows that MP4 frame drops are about 20x higher when e10s is enabled. Curiously, webm frame drops are "only" 5–10x worse with e10s.

http://people.mozilla.org/~ajones/frame-drop-test/
					Total	Drops	Drops
Format	Level	Resolution	Rate	frames	e10s	non-e10s
mp4	4.2	1920x1080	64.0	320	145	  0
webm		2560x1920	30.7	154	 51	  5
mp4	5.0	2560x1920	30.7	154	 54	  0
webm		3840x2160	30.7	159	 92	 82
mp4	5.1	3840x2160	30.7	159	 16	  1
webm		3840x2160	66.8	335	280	278
mp4	5.2	3840x2160	66.8	151	 92	  4
MP4 is hardware decoded so we're passing a queue of 3 frames over the IPC.
WebM is software decoded so we're passing a queue of 10 frames over IPC.

Brad - could the IPC be getting delayed? Does it go through the main thread?
Flags: needinfo?(blassey.bugs)
Flags: needinfo?(blassey.bugs) → needinfo?(wmccloskey)
Hi Anthony, what type of IPC do you mean? Some of it goes over the main thread and some doesn't. I imagine the ImageBridge is involved here. It's off the main thread. We have had some issues there because there's no rate-limiting (bug 1126078).

One issue might be that, as far as I know, we don't set thread priorities on our background threads. But that shouldn't be an issue on an otherwise idle system I wouldn't think.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #3)
> One issue might be that, as far as I know, we don't set thread priorities on
> our background threads. But that shouldn't be an issue on an otherwise idle
> system I wouldn't think.

We're talking about a system with a loaded up GPU but the CPU not doing too much. We have a queue of 3 frames which gives us a window of 32ms. That means a round trip delay approaching 16ms would probably be enough for us to fall behind.
Flags: needinfo?(ajones)
IPC roundtrip times are typically around 50 microseconds on an unloaded system.

Can you explain what code is involved here? In comment 2 you said:
> MP4 is hardware decoded so we're passing a queue of 3 frames over the IPC.
Where does that happen?
Flags: needinfo?(ajones)
(In reply to Bill McCloskey (:billm) from comment #5)
> Where does that happen?

Robert - can you answer that question?
Flags: needinfo?(roc)
I don't understand the question. If you're asking where does the IPC happen, it's in ImageBridgeChild::EndTransaction which calls PImageBridgeChild::SendUpdate or PImageBridgeChild::SetUpdateNoSwap.

This bug needs someone to dig in and figure out what's going on rather than guessing. Assuming this e10s discrepancy reproduces on Linux under rr, rr would make this a lot easier.
Flags: needinfo?(roc)
Anthony, can you get someone to do that digging?
Anthony says he will ask mwoodrow to look at this.
Matt - you didn't object strongly enough when I mentioned this... are you volunteering?
Flags: needinfo?(matt.woodrow)
I can do some investigation, yeah. Leaving ni? so that I don't lose this.
I can't reproduce this on either of my machines (OSX and Windows) unfortunately.

I'll be in the Auckland office the week after next (before Orlando), so I'll have a look on Anthony's machine while I'm there.
The e10s sandbox is blocking hardware accelerated video from working, so we're falling back to decoding in software.

Setting security.sandbox.content.level=0 lets it work again, and the results are much better.

Sandbox logging shows:

Process Sandbox BLOCKED: NtOpenKeyEx for : SOFTWARE

Brad, any idea who we can get to fix this?
Flags: needinfo?(matt.woodrow) → needinfo?(blassey.bugs)
Bob, thoughts?
Flags: needinfo?(blassey.bugs) → needinfo?(bobowen.code)
(In reply to Matt Woodrow (:mattwoodrow) from comment #13)
> The e10s sandbox is blocking hardware accelerated video from working, so
> we're falling back to decoding in software.
> 
> Setting security.sandbox.content.level=0 lets it work again, and the results
> are much better.
> 
> Sandbox logging shows:
> 
> Process Sandbox BLOCKED: NtOpenKeyEx for : SOFTWARE
> 
> Brad, any idea who we can get to fix this?

Can't reproduce on my workstation with Win7.

I can reproduce on my laptop on Win10 and it goes away when security.sandbox.content.level=1, so it looks like it must be something to do with using USER_INTERACTIVE or JOB_INTERACTIVE in the policy, I suspect the latter.

I'll investigate further.

Not sure if there might be something Windows version specific or my workstation is just fast enough to decode in software.
Matt - where is the actual code that attempts to use accelerated video and then falls back to software (or is there an easier way to tell if it is using software)?
Flags: needinfo?(bobowen.code) → needinfo?(matt.woodrow)
If you set layer.draw-borders=true, then you should see a red rectangle around video when it's hardware accelerated, and orange otherwise.

The initialization code is here: http://mxr.mozilla.org/mozilla-central/source/dom/media/platforms/wmf/WMFVideoMFTManager.cpp#156
Flags: needinfo?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #16)
> If you set layer.draw-borders=true, then you should see a red rectangle
> around video when it's hardware accelerated, and orange otherwise.
> 
> The initialization code is here:
> http://mxr.mozilla.org/mozilla-central/source/dom/media/platforms/wmf/
> WMFVideoMFTManager.cpp#156

This was perfect, thanks Matt.

It is indeed the JOB_INTERACTIVE on the policy that is causing the problem, using any JOB_OBJECT_UILIMIT_* restrictions on the job triggered the fall back to software.
It was falling back on Windows 7 on my workstation, but it was fast enough to cope.
I also found this comment [1] in the policy for Chromium's gpu process, so I thought that we might have to have a proxy window as that shouldn't cause the same deadlock issues for us (without the separate gpu process).

However, when I looked for the actual failing command I found it was at [2].
Using GetShellWindow here seems a bit odd, it returns the "Program Manager" window which is owned by the Windows Explorer process.
Apparently this is what chromium did, but they recently landed a change [3], which uses null for both HWNDs.

I've tried this and it appears to work, even though the docs suggest one of these should not be null.
Maybe it doesn't matter for the video decoding, we do create and set a focus window for the layers acceleration.

[1] https://chromium.googlesource.com/chromium/src/+/e3de838f5daa700732b05f037e5c08acdec2ffdb/content/browser/gpu/gpu_process_host.cc#216
[2] https://dxr.mozilla.org/mozilla-central/rev/5ba77225c957f335ee7771a781c5faeb316c06fd/dom/media/platforms/wmf/DXVA2Manager.cpp#295
[3] https://codereview.chromium.org/1421553014/
Do we need to drop both calls to ::GetShellWindow()? Or just the one within CreateDeviceEx?

From MSDN "For windowed mode, this parameter may be NULL only if the hDeviceWindow member of pPresentationParameters is set to a valid, non-NULL value."

We are specifying ::GetShellWindow() for hDeviceWindow (on line 290), so it seems valid to remove the one in CreateDeviceEx.
(In reply to Matt Woodrow (:mattwoodrow) from comment #19)
> Do we need to drop both calls to ::GetShellWindow()? Or just the one within
> CreateDeviceEx?
> 
> From MSDN "For windowed mode, this parameter may be NULL only if the
> hDeviceWindow member of pPresentationParameters is set to a valid, non-NULL
> value."
> 
> We are specifying ::GetShellWindow() for hDeviceWindow (on line 290), so it
> seems valid to remove the one in CreateDeviceEx.

As we briefly discussed, the chromium change specifies null for both of these, even though as you point out the documentation suggests that is invalid.

This is what I've done in the try push from comment 18.
Looks like b-m and W-e10s2 might be broken by that, so I'll need to look into those.

If we do have to pass some sort of HWND, it will have to be something other than the one returned by GetShellWindow.
Looks like you were right about the tests Matt.

Here's a push without my patch and it has the same failures:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd23fc03593b
Assignee: nobody → bobowen.code
Status: NEW → ASSIGNED
Attachment #8697297 - Flags: review?(matt.woodrow)
Attachment #8697297 - Flags: review?(matt.woodrow) → review+
https://hg.mozilla.org/mozilla-central/rev/f24cb2c2ddf0
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
Just realised that I forgot to clear this when I landed the patch.
Flags: needinfo?(ajones)
Depends on: 1236112
Depends on: 1322283
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: