Closed Bug 1230641 Opened 4 years ago Closed 4 years ago

playback of mp4 in MacOS Firefox Nightly is bad

Categories

(Core :: Audio/Video: Playback, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox51 --- fixed

People

(Reporter: dougt, Assigned: mattwoodrow, NeedInfo)

References

Details

Attachments

(7 files, 1 obsolete file)

https://streamable.com/g5dv
    https://cdn.streamable.com/video/mp4/g5dv.mp4


Play back of this mp4 is pretty bad in Firefox.  If you open it up in VLC or QuickTime, playback is fine.
What version of FF were you using?

IIRC, you mentioned on IRC that you used a 2015 MBPr 13" right?
We do skip lots of frames on that video (this is FF developer edition 44) nearly half of them

1478 frames, 631 decoded but only 304 painted...

Using FFmpeg h264 decoder it plays nicely.
Interestingly, on iPhone it shows some pauses, always in the same spot @4s
Video plays fine with a local central build... Can't think of any changes that would explain the behaviour on 44.
Assignee: nobody → jyavenard
This issue can be easily seen with the video set in full screen.
This is a 60fps video.

I added logs to find out the time it took for the decoder to decode a frame and how often MediaFormatReader::RequestVideoData is called

What it shows is that 99% of the time, the video MediaDataPromise is typically resolved within 1ms. There are times however where the Apple VT PDM will take > 30ms to return a frame, this always happen at around 6.95ms with over 10 frames being queued for decoding.

What is more worrying however, is that the time between calls to RequestVideoData is almost always > 16ms ; which is the bare minimum to handle 60fps.
Looking at when the MDSM is rescheduled for another run shows that we hit this:
http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaDecoderStateMachine.cpp#2513
constntly

I see mPlaybackRate being one, so we reschedule the MDSM every AUDIO_DURATION_USECS which is every 40ms.

This appears to be a regression from https://hg.mozilla.org/mozilla-central/rev/1dc6e120ebff05eb39ed64d6cbedbd72c0ab5bc9
Flags: needinfo?(kikuo)
Flags: needinfo?(jwwang)
I have a suspicion on why the mac VT decoder will sometimes takes over 30ms to decode a frame, will investigate that theory
(In reply to Jean-Yves Avenard [:jya] from comment #4)
> This issue can be easily seen with the video set in full screen.
> This is a 60fps video.
> 
> I added logs to find out the time it took for the decoder to decode a frame
> and how often MediaFormatReader::RequestVideoData is called
> 
> What it shows is that 99% of the time, the video MediaDataPromise is
> typically resolved within 1ms. There are times however where the Apple VT
> PDM will take > 30ms to return a frame, this always happen at around 6.95ms
> with over 10 frames being queued for decoding.
> 
> What is more worrying however, is that the time between calls to
> RequestVideoData is almost always > 16ms ; which is the bare minimum to
> handle 60fps.
> Looking at when the MDSM is rescheduled for another run shows that we hit
> this:
> http://mxr.mozilla.org/mozilla-central/source/dom/media/
> MediaDecoderStateMachine.cpp#2513
> constntly
> 
> I see mPlaybackRate being one, so we reschedule the MDSM every
> AUDIO_DURATION_USECS which is every 40ms.
> 
> This appears to be a regression from
> https://hg.mozilla.org/mozilla-central/rev/
> 1dc6e120ebff05eb39ed64d6cbedbd72c0ab5bc9

VideoSink keeps updating rendered frames here [1]. It's related to the first timestamp(greater than current) in queue. Just like original logic it was.
[1] https://hg.mozilla.org/mozilla-central/rev/1dc6e120ebff05eb39ed64d6cbedbd72c0ab5bc9#l4.375
As you said, the number of painted frames is much lower than decoded frames, this should be checked in VideoSink and underlying machinery. 


RequestVideoData() mostly is triggered when there's a decoded video data pushed into VideoQueue [2].
[2] http://mxr.mozilla.org/mozilla-central/source/dom/media/MediaDecoderStateMachine.cpp#705
That should not be much related to the (40ms) scheduling in MDSM which attempts to update playback position periodically. The problem here to me is the decoder taking too long for decode I think.
Sorry I don't have a mac environment for further testing.
Flags: needinfo?(kikuo)
It looks like bug 1218311 is not uplifted to 44. Does it happen to 45?
Flags: needinfo?(jwwang)
(In reply to Kilik Kuo [:kikuo][:kilikkuo] from comment #6)
> http://mxr.mozilla.org/mozilla-central/source/dom/media/
> MediaDecoderStateMachine.cpp#705
> That should not be much related to the (40ms) scheduling in MDSM which
> attempts to update playback position periodically. The problem here to me is
> the decoder taking too long for decode I think.
> Sorry I don't have a mac environment for further testing.

Then why the very long delay between calls to RequestVideoData?
Even when the video promise is resolved in less than 1ms, the next call to RequestVideoData will typically be > 30ms after the previous one.

We are mostly slow at decoding because RequestVideoData isn't called quickly enough
(In reply to JW Wang [:jwwang] from comment #7)
> It looks like bug 1218311 is not uplifted to 44. Does it happen to 45?

yes, but it's not as bad as in 44. 44 will typically skips more frames, to the point where it's unwatchable
(In reply to Jean-Yves Avenard [:jya] from comment #8)
> (In reply to Kilik Kuo [:kikuo][:kilikkuo] from comment #6)
> > http://mxr.mozilla.org/mozilla-central/source/dom/media/
> > MediaDecoderStateMachine.cpp#705
> > That should not be much related to the (40ms) scheduling in MDSM which
> > attempts to update playback position periodically. The problem here to me is
> > the decoder taking too long for decode I think.
> > Sorry I don't have a mac environment for further testing.
> 
> Then why the very long delay between calls to RequestVideoData?
> Even when the video promise is resolved in less than 1ms, the next call to
> RequestVideoData will typically be > 30ms after the previous one.
> 
> We are mostly slow at decoding because RequestVideoData isn't called quickly
> enough

If VideoDataPromise can be resolved in less than 1ms, so that a OnVideoDecoded(or OnVideoNotDecoded) should be called to receive newly decoded data. Then a call to MDSM::Push, then to MDSM::DispatchDecodeTasksIfNeeded, then MDSM::EnsureVideoDecodeTaskQueued, and then a MDSM::RequestVideoData in final to trigger another round of video requesting.
Unless there's no need to decode video in certain cases, one of the cases is that there're enough video data. But this should still doesn't make impact on videosink rendering frames.
https://www.mozilla.org/en-US/firefox/44.0a2/auroranotes/

It looks like 44.0a2 is released on November 3 which doesn't include bug 1218311.
You think it's bug 1218311 that caused the regression?

I'm not entirely convinced it's a regression though, I've seen it in 44 too, albeit less severe
bug 1218311 is about porting the fix of bug 1193614 to VidoeSink.
Is this still an issue?
Priority: -- → P1
yes... though I had temporarily given up on it...

When timing the apple's VT decoder itself, almost every frames took > 16ms to decode (and this a 60fps video). Which is super weird as FFmpeg has no problem with it.

Yet, Apple using the same framework plays that video just fine.
Assignee: jyavenard → nobody
I have an iMac and running on the latest OS, i have opened the provided link with the provided link, no problems has been found with the video feedback, tested it out on quickplayer and is running the same.

https://cdn.streamable.com/video/mp4/g5dv.mp4
which imac was that though?

the issue can be seen on my MBP 2015 13" retina with an Intel Iris Graphics 6100.

it happens consistently there. unfortunately, still no clue on why this would be happening on this laptop and why the intel decoder takes so long for some of the frames
Assignee: nobody → jyavenard
I can reproduce a similar issue using VLC when I enable Apple's VDA hardware decoder.
Over 30% of frames are dropped, though the frames are dropped more smoothly than we do so the video is looks better with VLC.

If I disable the hardware decoder, playback is still bad. However, disabling both the hardware decoder and the use of GPU surface to instead use software decoder: then it works okay.

Looking at the CPU usage and the process running when playing this video with Safari, I can't see the VTDecoderXPCService process kicking in; so it appears Safari isn't using the hardware decoder either.

So maybe what we should do is simply blacklist on mac the hardware decoder for certain resolution and framerate like we do on windows.
When using the VT software decoder with software YUV buffer, playback plays almost smoothly, with very few frames being dropped (and skip to next keyframe logic never kicks in).

When using IOSurface however the result is very bad. Seeing that we are always using IOSurface during decoding, this indicates that the decoder would be waiting for an IOSurface to be made available after use (used by the compositor)
Also: with the default setting, the average time taken by the decoder to output a frame is 26404us.
If after starting playback, I immediately tab out, this time drops to 14ms.

The VideoToolbox framework uses a CVPixelBufferPool to allocate frames.

VT provides an API to control the minimum number of frames available in this pool:

	@constant	kVTDecompressionPropertyKey_OutputPoolRequestedMinimumBufferCount
	@abstract
		Requests that the VTDecompressionSession use the value provided as a minimum buffer
		count for its output CVPixelBufferPool, not releasing buffers while the number in
		use is below this level.
	@discussion
		This property effectively requests that the kCVPixelBufferPoolMinimumBufferCountKey key
		be used for the creation of the output CVPixelBufferPool.
		
		For general playback cases, standard CVPixelBufferPool age-out behaviour should be 
		sufficient and this property should not be needed.  This property should only be used in 
		unusual playback scenarios where a peak pool level is known, and the potential 
		memory overhead is an acceptable tradeoff for avoiding possible buffer reallocation.
		
		Setting this property to NULL or passing in the value 0 will clear this setting and
		remove the minimum buffer count.
		
		Setting this property while a VTDecompressionSession is in use will result in the 
		creation of a new CVPixelBufferPool. This will cause new buffers to be allocated, and 
		existing buffers to be deallocated when they are released.
So still progressing here.

I've timed various things.
The video attached is made of 1422 samples.
If I set the decode ahead value to 1422 (so there is never any wait between the time we feed the decoder with new data). It shows that the average time to decode a frame is 12ms, and the median time is 8ms.
This is surprisingly low because the dummy decoder I wrote using identical decoding code takes an average of 8ms with a median time of 4ms.
So it's much slower, but I can't explain why.

Now I then timed the following:
1- Time between a call to RequestVideoData and the time the video promise is resolved
2- Time between when a video promise is resolved and the next RequestVideoData
3- Time between two call to MDM::OnVideoPopped

With a decode ahead settings of 10.
1)
min	52.83
max	424460.96 <- this is the time between the first frame decoding and the next, and can be ignored
average	3023.913268
median	269.83

So this works quite well, while it may take a while for a video to be decoded from time to time, the majority of the time, it's short and as decode ahead works as expected, it explain the low average.

2)
min	136.53
max	7050229.88
average	18543.18263
median	13203.22

now this is bad. 18.5ms average between requests makes it impossible to play a 60fps video properly.

3)
min	20
max	91748.69
average	6338.207178
median	2930.62

this is allright, it shows how often the video queue pop a sample.
Now, it's puzzling on why time between OnVideoPopped be 6ms, yet the time between RequestVideoData is so high. Logs shows that MDSM::NeedToDecodeVideo() typically returns false. So no video decoding is queued.
JW, what is the logic on how often RequestVideoData will be called, and what could explain why such delays are seen?

thanks
Flags: needinfo?(jwwang)
ok, found out what's going on.

Because it's HW decoding, a new request will only be made if the VideoQueue size is < 3.

Additionally, VideoQueue().GetSize() shows at 3 even as OnVideoPopped got called. Sounds like a race between the time the nsdequeue is popped and the time the MediaEventProducer listener is called.
Flags: needinfo?(jwwang)
OnVideoPopped() is called asynchronously when video samples are popped from the queue. Therefore it is possible the queue is filled again by the time OnVideoPopped() is called. However, it should be fine since DispatchVideoDecodeTaskIfNeeded() always check if the queue is full before really dispatching decoding tasks.

What race are you concerned about?
(In reply to JW Wang [:jwwang] from comment #23)
> What race are you concerned about?

it turned out to be something else.

Most of the time the VideoSink pop a video frame, it immediately push it back
https://dxr.mozilla.org/mozilla-central/source/dom/media/mediasink/VideoSink.cpp#399

so the task that is being dispatched serves no purpose whatsoever.

More often than not however, logs show that the reschedule of the media sink is set to be 39ms (basically to coincide when the next frame is about to be popped from the video queue).

That means that the VideoSink calls the MDSM that it needs a new frame right when it's already too late to get one (as the video decoder takes sometimes > 16ms).

The time the media sink is being put on hold is wasted, when it could have scheduled the MDSM to get a new frame earlier instead.

Everything works well if the decoder is very fast and can decode a frame in less than the time it took to display the current one (e.g. < 16ms)

However, in this particular video, while the average time to decode is about 12ms, the variance is great (e.g it goes like 2ms, 2ms, 3ms, 38ms, 29ms, 2ms)
As when HW accelerated the queue is only 3 frames deep, being slow to decode a single frame is enough to kick the skip to the next key frame logic.

If things were done more regularly, decoding in advance, and more importantly, the MDSM requesting the frames more often (requesting frames every 18ms is never going to allow playback of a 60fps video) we would have a chance to play.

right now, it fails miserably.

I'm afraid however that this bug only reveals a fundamental issue with our entire gecko architecture, we have way too many threads going, so task dispatches are slow (I've seen up to 18ms between the time a task is queued on the reader's task queue and the time it actually runs)
(In reply to Jean-Yves Avenard [:jya] from comment #24)
> More often than not however, logs show that the reschedule of the media sink
> is set to be 39ms (basically to coincide when the next frame is about to be
> popped from the video queue).

This shouldn't be true. The video sink is scheduled to wake up at the start time of the next frame. E.g. if the current time is 100ms, and the start time of next video frame is 110ms, video sink will wake up 10ms later to render the next frame.
(In reply to JW Wang [:jwwang] from comment #25)
> This shouldn't be true. The video sink is scheduled to wake up at the start
> time of the next frame. E.g. if the current time is 100ms, and the start
> time of next video frame is 110ms, video sink will wake up 10ms later to
> render the next frame.

not quite, it is scheduled to wake up after the start of the next frame whose time is past the current time
      if (nextFrame->mTime > clockTime) {
        remainingTime = nextFrame->mTime - clockTime;
        break;
      }

so if frames are dropped because they are too late, the delay will be greater than a frame duration.

I added logs in VideoSink doing:
  TimeDuration next =
    TimeDuration::FromMicroseconds(remainingTime / mAudioSink->GetPlaybackParams().mPlaybackRate);
  TimeStamp target = nowTime + next;
  fprintf(stderr, "rescheduling in:%.02f queuesize:%u\n", next.ToSeconds() * 1000000, (uint32_t)VideoQueue().GetSize());

and in the logs, I see plenty of burst with:
rescheduling in:39885.00 queuesize:11
popping image at front
pushing image at front
rescheduling in:39885.00 queuesize:11
OnVideoPopped, time since last popped: 42965.02
UpdateplaybackPositionPeriodically ScheduleStateMachineIn: 40000
popping image at front
pushing image at front
rescheduling in:39885.00 queuesize:11
OnVideoPopped, time since last popped: 36108.73

on this run, the average time between calls to RequestVideoData is 18543.18263us
If frames are dropped, they don't even exist to the video sink. And the video sink still wake up at the start time of "next" frame, right?
why wouldn't they exist in the videosink ? 
https://dxr.mozilla.org/mozilla-central/source/dom/media/mediasink/VideoSink.cpp#385

loops over all frames in the video queue until it finds the first whose start time is > clock, all others are dropped.
(In reply to JW Wang [:jwwang] from comment #27)
> If frames are dropped, they don't even exist to the video sink. And the
> video sink still wake up at the start time of "next" frame, right?

yes, it will wake up at the start time of the next frame, but it does nothing until then. So OnVideoPopped isn't called, and as such RequestVideoData isn't called.

While we obviously have issues on this video on this particular machine speed wise. Decoding in advance rather than waiting until it's too late to decode would allow to play this video.
(In reply to Jean-Yves Avenard [:jya] from comment #29)
> yes, it will wake up at the start time of the next frame, but it does
> nothing until then. So OnVideoPopped isn't called, and as such
> RequestVideoData isn't called.

https://dxr.mozilla.org/mozilla-central/source/dom/media/MediaDecoderStateMachine.cpp#704
MDSM will also request next samples when a sample is pushed to the queue. In fact, MDSM always tries to decode as much as possible to keep the queues full without intervention from video sink.
As I wrote, the average time between the moment the video data promise is resolved and the next RequestVideoData is 18ms, so the MDSM does a poor job at keeping the queue full in time :(

https://docs.google.com/spreadsheets/d/1gtApsRPRRFhwy1zto3tnAM7ed5SexAFoRJOCsSIli04/edit?usp=sharing

on tab 2 you see the various times I mentioned in comment 20
Now of course, it could just be the MDSM task queue is too busy processing other tasks, or the overall system is too busy to action anything
See Also: → 1195767
this is a 34s profiler run when playing the video.

The biggest chunk of time is spent in 

The compositor thread takes 17% of CPU time

Still the total amount of CPU time is 26.7s, so it's not particularly huge seeing that this machine is a dual core.

compositor biggest task is:
Running Time	Self (ms)		Symbol Name
3042.0ms   11.3%	1.0	 	                    mozilla::layers::LayerManagerComposite::UpdateAndRender()

main thread biggest tasks are:
Running Time	Self (ms)		Symbol Name
2835.0ms   10.5%	 nsRefreshDriver::Tick(long long, mozilla::TimeStamp)
2541.0ms    9.4%	 mozilla::dom::HTMLMediaElement::nsAsyncEventRunner::Run()
JW pointed out on how bad performance was when using full screen. And indeed, decoding time slows to a crawl when playing in full screen.

Matt, anything out of the ordinary coming out here? CPU usage is almost the same as when in windowed mode.
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(jmuizelaar)
The mac decoder when used on intel GPU gives bad performance due to locking. Increasing the queue size allows to alleviate most problems.

Review commit: https://reviewboard.mozilla.org/r/39923/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/39923/
Attachment #8730481 - Flags: review?(cpearce)
Attachment #8730481 - Flags: review?(cpearce) → review+
Comment on attachment 8730481 [details]
MozReview Request: Bug 1230641: P2. Increase the video queue size on mac. r?cpearce

https://reviewboard.mozilla.org/r/39923/#review36483
Comment on attachment 8730480 [details]
MozReview Request: Bug 1230641: P1. Use UYVY (YUV422) format in decoders. r?mattwoodrow

https://reviewboard.mozilla.org/r/39921/#review36501
Attachment #8730480 - Flags: review?(matt.woodrow) → review+
(In reply to Jean-Yves Avenard [:jya] from comment #36)
> Created attachment 8730480 [details]
> MozReview Request: Bug 1230641: P1. Use UYVY (YUV422) format in decoders.
> r?mattwoodrow
> 
> Review commit: https://reviewboard.mozilla.org/r/39921/diff/#index_header
> See other reviews: https://reviewboard.mozilla.org/r/39921/

How does changing format improve playback?
> How does changing format improve playback?

when using YUV420P, we end up having two CVPixelBufferPool, one used by the hardware decoder, and one used to convert the data and pass it to the client (the AppleVTDecoder).

By using YUV422, the CVPixelBufferPool is shared internally halving the memory usage of the decoder and shelving some times as the conversion from the format returned by the decoder (UYVY) doesn't need to be converted to NV12.

Additionally (and more importantly). UYVY surface can be uploaded to an OGL surface and handled as if it was RGB. So the compositor no longer need a shader to convert UYVY into RGB and composite things (see bug 1248323)

Those two patches do not fix the main issue, they just help.
Depends on: 1248323
Keywords: leave-open
Flags: needinfo?(matt.woodrow)
Attached patch Timing log module AppleTiming (obsolete) — Splinter Review
MozReview-Commit-ID: JPKhjyycpJG
MozReview-Commit-ID: JPKhjyycpJG
Attachment #8762675 - Attachment is obsolete: true
With the help of :mstange, I created this project https://github.com/jyavenard/DecodeTest

It decodes and displays this video in a minimal application. It's very rough, being on PTO, my aim was only to get something quick

Decoding is slightly different to how it would occur within gecko, in that the demuxing+feeding of the decoder is performed on the main thread.

With the provided shader, the majority of the time is spent in [mContext flushBuffer]; in the function drawme.

I can do a more similar threading model as gecko if you think that could help.
Experimentation based on top of 1bab81ac53da1b755fc83384c7f12b26f5554752.

I played around with this for a while, adding a proper frame queue and trying to copy more exactly what gecko does.

I found that when NSOpenGLCPSurfaceOpacity was set to 0 (what gecko does unconditionally) and when the NSTitledWindowMask window mask flag was removed (what gecko does during fullscreen) then performance was awful (20fps roughly).

mstange noticed the connection between these two and suggested trying [window setOpaque:YES], which unfortunately doesn't help.

The slowdown appears to be within [NSOpenGLContext flushBuffer] for a double buffered GL context, and within glFlush for a single buffered one. The time is spent waiting on mach_msg_trap.

It's not really clear why having a transparent context affects performance so much, or why removing the window title makes it show up.

I believe we have had somewhat poor performance even in windowed mode in gecko, so removing the window title may just exacerbate the opacity issue rather than actually cause it.
Assignee: jyavenard → matt.woodrow
Unfortunately we can't have an opaque context at all times since we need transparency to render rounded border on our window.

The poor video playback seems to be primarily when in fullscreen, so we can switch to using an opaque context during fullscreen and get much improved performance.

I'm not sure if there's much we can do for windowed without giving up on rounded corners. We could try asking Apple if there's a solution for this?

It would be interesting to see if other performance is improved with opaque contexts, it might be significant enough to accept the loss of rounded corners.
Attachment #8776827 - Flags: review?(mstange)
Comment on attachment 8776827 [details] [diff] [review]
Make the GLContext opaque when we can

Review of attachment 8776827 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. It's possible that starting with 10.10 we can make the context opaque even for regular windows, because the window server will apply the rounding. We'll need to experiment with that again.

pcwalton told me that servo always uses an opaque context, and they get the rounding by using a private API to describe a window shape region to the window server, but that doesn't give them anti-aliasing in the rounded corner. Maybe we can do better.

::: widget/cocoa/nsChildView.mm
@@ +2017,5 @@
>    CGFloat cornerRadius = [(ChildView*)mView cornerRadius];
>    mDevPixelCornerRadius = cornerRadius * BackingScaleFactor();
>    mIsCoveringTitlebar = [(ChildView*)mView isCoveringTitlebar];
>    NSInteger styleMask = [[mView window] styleMask];
> +  bool isFullscreen = (styleMask & NSFullScreenWindowMask) || !(styleMask & NSTitledWindowMask);

"!(styleMask & NSTitledWindowMask)"? That works, I guess, but only until we start accelerating popup windows. And it's existing code. I'll have to keep this in mind for when we actually want to start accelerating popup windows.
Attachment #8776827 - Flags: review?(mstange) → review+
Pushed by mwoodrow@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bfc47d8a87ef
Make our NSOpenGLContext opaque when in fullscreen mode. r=mstange
Depends on: 1292288
I think we should back this out due to bug 1292288.
Status: NEW → RESOLVED
Closed: 4 years ago
Keywords: leave-open
Resolution: --- → FIXED
Depends on: 1295217
Depends on: 1297879
Target Milestone: --- → mozilla51
You need to log in before you can comment on or make changes to this bug.