Closed Bug 1119947 Opened 9 years ago Closed 9 years ago

Add telemetry for video playback

Categories

(Core :: Audio/Video, defect, P1)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
firefox37 + fixed
firefox38 --- fixed

People

(Reporter: ajones, Assigned: cajbir)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 5 obsolete files)

The things we need for telemetry for video playback are:

* Time spent buffering and whether the buffering is at the start, after seeking or during playback.
* Proportion of frames dropped and incidences of multiple frames being dropped
* Occurrances of skipping to the next keyframe
Benjamin - can you find someone to help with adding telemetry? I can point to the pieces of code that have the information we need.
Flags: needinfo?(benjamin)
Can you articulate the questions we're trying to answer in the end? That will help design the right set of histograms or other measurements.

IIRC for the buffering metrics there were two different questions to answer. One is a performance question, and the other is detecting bugs that cause us to buffer forever. In order to answer the question "how many videos got stuck buffering" we'd need to know how many videos got started, no? Or maybe something even more nuanced such as how much video-time was spent at all.

For "proportion of frames dropped" we'd need to know the total number of frames and separately the number of drops, right?

Does "skipping to the next keyframe" mean something the user does while seeking, or something the decoder itself does? Is that caused by not keeping up and having to skip frames to play smoothly?
Flags: needinfo?(benjamin) → needinfo?(ajones)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #2)
> Can you articulate the questions we're trying to answer in the end? That
> will help design the right set of histograms or other measurements.
<
> IIRC for the buffering metrics there were two different questions to answer.
> One is a performance question, and the other is detecting bugs that cause us
> to buffer forever. In order to answer the question "how many videos got
> stuck buffering" we'd need to know how many videos got started, no? Or maybe
> something even more nuanced such as how much video-time was spent at all.

Primarily we want to know how many people get stuck buffering forever. This is the essential metric. It is a smoking gun. How many videos got played or how many minutes of videos got played give us more context.

Another issue we are concerned about is rebuffering once the video starts. This is affected by the decisons that the side Javascript makes. The main thing we can tweak is the logic of when we resume playback.

> For "proportion of frames dropped" we'd need to know the total number of
> frames and separately the number of drops, right?

Yes. We already collect these metrics during playback and in YouTube they're shown on stats for nerds.

> Does "skipping to the next keyframe" mean something the user does while
> seeking, or something the decoder itself does? Is that caused by not keeping
> up and having to skip frames to play smoothly?

Skipping to the next keyframe is something we do when decoding can't keep up with playback. We want to know if we have our logic right because in the past we've seen cases where disabling it makes things better. Knowing something about the kinds of machines that see this problem will also help us a lot.
Flags: needinfo?(ajones)
Assignee: nobody → benjamin
Priority: -- → P1
I poked at this briefly this morning. I believe that the most important buffering question is best answered with a single boolean histogram which measures the state of a MSE video element when it is unloaded.

VIDEO_MSE_BUFFERING_AT_UNLOAD:boolean

If the element was buffering when the page unloads, we'll record true, otherwise false. This should give us a metric of all the MSE video loads (either true or false, plus a ratio of how many were buffering.

* Is it sufficiently accurate to do this from HTMLMediaElement::UnbindFromTree?
* If HTMLMediaElement::mMediaSource is non-null, does that mean this is an MSE element?
* Do I want to be checking mReadyState/mNetworkState or something else?
Flags: needinfo?(ajones)
Flags: needinfo?(cajbir.bugzilla)
* Yes, UnbindFromTree should be sufficiently accurate.
* Yes, mMediaSource being non-null means that this is an MSE element
* mReadyState == HAVE_CURRENT_DATA seems to be what should be checked, according to :cpearce.

Needinfo me if you need more info. Thanks!
Flags: needinfo?(cajbir.bugzilla)
Flags: needinfo?(ajones)
Attached patch 1119947-mse-telemetry.patch (obsolete) — Splinter Review
I'm not sure this works usefully. I don't know enough about MSE to construct a testcase where we actually do buffer indefinitely. I did verify that this records some values when watching youtube videos, and that the value changes when I go offline while loading a video. 

I'd appreciate not just review on this, but some dedicated QA with artificial testcases to make sure this can measure what you need. Marcia, is this something you can help with?
Flags: needinfo?(mozillamarcia.knous)
Attachment #8550508 - Flags: review?(cpearce)
Oh, I forgot: here's the try run for this, so you can do QA: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ef1e8294c1a
Comment on attachment 8550508 [details] [diff] [review]
1119947-mse-telemetry.patch

Review of attachment 8550508 [details] [diff] [review]:
-----------------------------------------------------------------

I think cajbir is a better reviewer for media telemetry stuff.
Attachment #8550508 - Flags: review?(cpearce) → review?(cajbir.bugzilla)
Attachment #8550508 - Flags: review?(cajbir.bugzilla) → review+
I don't yet know enough about this to provide test cases. Are you looking for a video that we can script with the MSE API? Are you looking for videos that have problems now, or had them in the past?
Flags: needinfo?(benjamin)
There are two parts: in abstract, I'd like to know that this records the correct things for an arbitrary MSE video in the following cases:

* the video finishes playing and then we unload the page
* the video is playing while we unload the page
* the video gets stuck buffering forever and we unload the page

Also in specific, I'd like to make sure that this can produce useful data *on youtube.com*, since that is the prime target that we care about in Firefox 36.
Flags: needinfo?(benjamin)
This will be discussed with the team at today's standup for next steps.
Flags: needinfo?(mozillamarcia.knous)
Chris - can you help Marcia figure out how to check if we have the right information?
Flags: needinfo?(cpearce)
I think relying on (mReadyState == HAVE_CURRENT_DATA) alone as I previously suggested to indicate playback has stalled due to buffering could be unreliable, as if the video element reached end of playback, or if the CPU was too slow we could we'd get a false positive.

I think the most reliable way to determine that the user bailed out due to buffering in UnbindFromTree() would be to get the buffered ranges there, and iterate over them to find the range containing the currentTime, and if the currentTime is at the end of a range and we're not ended.

The "at the end of a range" test needs to be fuzzy, as the ranges and currentTime are floating point.
bsmedberg: I'll write the HTMLMediaElement.cpp parts of that patch if you like.
Flags: needinfo?(cpearce)
I think we need to report telemetry in the suspend branch of HTMLMediaElement::SuspendOrResumeElement(), as that triggers immediately on tab close and navigation, whereas UnbindFromTree() only triggers on GC, which could be some time after the tab close/navigation, and the SuspendOrResumeElement() that happens on tab close/navigation will pause the media, contaminating the stats we're trying to collect.
Do we also want to track whether the page closes while seeking? That would track whether the user has given up because the seek failed.
Attached patch WIP patch 2 (obsolete) — Splinter Review
Here's my WIP patch with the hopefully more accurate buffered detection. This needs to be tested. I need to get bug 1123535, so I probably won't get back to this for the next day or so.
Assignee: benjamin → cajbir.bugzilla
This adds telemetry for when the user exits video playback during the following times:
    
1. The video has ended
2. The video is in the middle of a seek
3. The video is buffering
4. If the video is paused by the user, but not due to seeking or ended.
    
More telemetry will be added in bug 1127646.

I'm not really sure we want the telemetry on if the video is paused by the user but I left that in from the original patch. I don't report the 'paused' if we are in the middle of the seek or ended as that's reported separately anyway.

Review request for :cpearce for the media stuff and :bsmedberg for the telemetry stuff.

I've tested this and it logs to the nsMediaElement logger correctly when exiting the video at the right times.
Attachment #8550508 - Attachment is obsolete: true
Attachment #8556104 - Attachment is obsolete: true
Attachment #8563892 - Flags: review?(cpearce)
Attachment #8563892 - Flags: review?(benjamin)
Status: NEW → ASSIGNED
Comment on attachment 8563892 [details] [diff] [review]
Add telemetry for MSE video actions

Review of attachment 8563892 [details] [diff] [review]:
-----------------------------------------------------------------

Almost, but we need to either fix dormant mode to not interfere with this, or write another patch to make us immune to the damage wrought by us transitioning through dormant state before we shutdown.

::: dom/html/HTMLMediaElement.cpp
@@ +2554,5 @@
> +  // 1. If the video has ended
> +  // 2. If the video is in the middle of a seek
> +  // 3. If the video is buffering
> +  // 4. If the video is paused by the user, but not due to seeking or ended.
> +  bool seeking = Seeking();

I tested on YouTube on Windows. The fact that on Windows we go dormant before shutting down (which is something we should fix IMO) is messing things up here. When I close the tab, we transition through dormant mode (thanks to calling mDecoder->NotifyOwnerActivityChanged() inside HTMLMediaElement::NotifyOwnerDocumentActivityChanged(), which runs before SuspendOrResumeElement() where we report the telementry, and going dormant changed the state, messing up the reporting.

If the video is paused and I close the tab, |seeking| here is reported as true and |paused| is reported as false. That's the opposite of what's the intended here? Even if the video is playing, |seeking| is true.

If I set media.decoder.heuristic.dormant.enabled=false, everything works as I'd expect.

So before we can land this, someone will need to prevent us entering dormant mode before shutting down, when we know we're about to shutdown.

@@ +2575,5 @@
> +    const double errorMargin = 0.05;
> +    double t = CurrentTime();
> +    TimeRanges::index_type index = ranges->Find(t, errorMargin);
> +    ErrorResult ignore;
> +    auto s = ranges->Start(index, ignore);

I keep getting told off by bz for using auto for non nightmarish-template-hell types. So I feel compelled to pay it forward here and ask you to fully specify the simple type here.

@@ +2587,5 @@
> +  LOG(PR_LOG_DEBUG, ("%p VIDEO_MSE_STALLED_AT_UNLOAD = %d", this, stalled));
> +  Telemetry::Accumulate(Telemetry::VIDEO_MSE_STALLED_AT_UNLOAD, stalled);
> +
> +  LOG(PR_LOG_DEBUG, ("%p VIDEO_MSE_SEEKING_AT_UNLOAD = %d", this, seeking));
> +  Telemetry::Accumulate(Telemetry::VIDEO_MSE_SEEKING_AT_UNLOAD, seeking);

Style nit: report seeking up near where you declare the `seeking` variable, so that variables are declared close to where they're used.
Attachment #8563892 - Flags: review?(cpearce) → review-
Address review comments. In HTMLMediaElement::NotifyOwnerDocumentActivityChanged I check if we are being destroyed and if so don't call the decoder to do the dormant code. I tested this on win64 to ensure it prevents the seek being true when closing the tab.

:bsmedberg for the telemetry code.
Attachment #8563892 - Attachment is obsolete: true
Attachment #8563892 - Flags: review?(benjamin)
Attachment #8564856 - Flags: review?(cpearce)
Attachment #8564856 - Flags: review?(benjamin)
Comment on attachment 8564856 [details] [diff] [review]
Fix with review comments addressed

Review of attachment 8564856 [details] [diff] [review]:
-----------------------------------------------------------------

Almost... There's still a problem caused by dormant mode here...

If the video element is dormant (i.e. has been non-playing and non-visible for 60+ seconds) and we close the tab, the stats are incorrectly reported; the video is reported as seeking and non-paused at unload.

STR:
1. Set media.decoder.heuristic.dormant.timeout=0 to make this faster to reproduce
2. Open video. Pause video once it starts playing.
3. Change tabs so that video is in background
4. Observe "Change machine state from DECODING to DORMANT" in NSPR MediaDecoder logs.
5. Close the video's tab.

Observe in NSPR logs:
VIDEO_MSE_SEEKING_AT_UNLOAD = 1
VIDEO_MSE_PAUSED_AT_UNLOAD = 0

I would expect the opposite.

When a non-visible media pauses or ends, or when a paused/ended media is made non-visible, we start a timer which will make the video dormant.

Here's the backtrace when the timer runs:

xul.dll!mozilla::MediaDecoderStateMachine::SetDormant(bool aDormant) Line 1501	C++
xul.dll!mozilla::MediaDecoder::UpdateDormantState(bool aDormantTimeout, bool aActivity) Line 189	C++
xul.dll!mozilla::MediaDecoder::DormantTimerExpired(nsITimer * aTimer, void * aClosure) Line 209	C++
xul.dll!nsTimerImpl::Fire() Line 631	C++
xul.dll!nsTimerEvent::Run() Line 729	C++

Also, when media.decoder.heuristic.dormant.timeout=0, we'll still enter dormant on the way to shutdown. That's an edge case in practise, as the user would have to get lucky with the timer firing exactly at the time when the user closes the tab, but we should handle it I think.
Attachment #8564856 - Flags: review?(cpearce) → review-
Comment on attachment 8564856 [details] [diff] [review]
Fix with review comments addressed

What is the reason for having a bunch of separate histograms instead of one enumerated? Is that because a video can be in multiple states (stalled and seeking?) and we want to be able to record that separately?

If that's the case, this is ok, although if you can really only be in one state at unload, I'd prefer a single enumerated histogram.
Attachment #8564856 - Flags: review?(benjamin) → review+
Tracking all MSE P1 bugs for Firefox 37.
* Fix dormancy issue where a dormant video was always reported as seeking.
* Fix case where UpdateDormantState was called from timer
* Change telemetry to use an enumeration instead of separate values.

r+ carried forward from :bsmedberg.
Attachment #8564856 - Attachment is obsolete: true
Attachment #8566190 - Flags: review?(cpearce)
Comment on attachment 8566190 [details] [diff] [review]
Fix with review comments addressed

Review of attachment 8566190 [details] [diff] [review]:
-----------------------------------------------------------------

Tested, and it works here. I was unable to hit a stalled condition in my testing, but the other conditions behave as expected.

::: dom/html/HTMLMediaElement.cpp
@@ +2586,5 @@
> +    double t = CurrentTime();
> +    TimeRanges::index_type index = ranges->Find(t, errorMargin);
> +    ErrorResult ignore;
> +    double s = ranges->Start(index, ignore);
> +    double e = ranges->End(index, ignore);

double s and e are unused here now.
Attachment #8566190 - Flags: review?(cpearce) → review+
Carrying forward r+ from cpearce and bsmedberg. Removes unused variables that cpearce noticed and an additional one picked up by try build failure.
Attachment #8566190 - Attachment is obsolete: true
Attachment #8566753 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/fb653123d555
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
This is an MSE P1 that I assume we want for Beta 1. Can you please request uplift before Monday?
Flags: needinfo?(giles)
Attached patch Backport to 37Splinter Review
Rebase around bug 1091140 for landing on 37. Cajbir, can you double-check this please?
Flags: needinfo?(giles)
Attachment #8567358 - Flags: feedback?(cajbir.bugzilla)
Comment on attachment 8567358 [details] [diff] [review]
Backport to 37

Approval Request Comment
[Feature/regressing bug #]: MSE
[User impact if declined]: Less effective testing.
[Describe test coverage new/current, TreeHerder]: Landed on m-c.
[Risks and why]: This adds telemetry reporting of media decoder state at teardown to help diagnose issues. It requires a new method in HTMLMediaElement, but is straightforward. I'm not worried about the risk, and it's better to get it in sooner than later.
[String/UUID change made/needed]: None.
Attachment #8567358 - Flags: approval-mozilla-aurora?
Attachment #8567358 - Flags: feedback?(cajbir.bugzilla) → feedback+
Comment on attachment 8567358 [details] [diff] [review]
Backport to 37

I agree about the need to land this sooner rather than later as we need this to get some objective measures for MSE. Aurora+
Attachment #8567358 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: