Closed Bug 1119947 Opened 8 years ago Closed 8 years ago
Add telemetry for video playback
The things we need for telemetry for video playback are: * Time spent buffering and whether the buffering is at the start, after seeking or during playback. * Proportion of frames dropped and incidences of multiple frames being dropped * Occurrances of skipping to the next keyframe
Benjamin - can you find someone to help with adding telemetry? I can point to the pieces of code that have the information we need.
Can you articulate the questions we're trying to answer in the end? That will help design the right set of histograms or other measurements. IIRC for the buffering metrics there were two different questions to answer. One is a performance question, and the other is detecting bugs that cause us to buffer forever. In order to answer the question "how many videos got stuck buffering" we'd need to know how many videos got started, no? Or maybe something even more nuanced such as how much video-time was spent at all. For "proportion of frames dropped" we'd need to know the total number of frames and separately the number of drops, right? Does "skipping to the next keyframe" mean something the user does while seeking, or something the decoder itself does? Is that caused by not keeping up and having to skip frames to play smoothly?
Flags: needinfo?(benjamin) → needinfo?(ajones)
Assignee: nobody → benjamin
Priority: -- → P1
I poked at this briefly this morning. I believe that the most important buffering question is best answered with a single boolean histogram which measures the state of a MSE video element when it is unloaded. VIDEO_MSE_BUFFERING_AT_UNLOAD:boolean If the element was buffering when the page unloads, we'll record true, otherwise false. This should give us a metric of all the MSE video loads (either true or false, plus a ratio of how many were buffering. * Is it sufficiently accurate to do this from HTMLMediaElement::UnbindFromTree? * If HTMLMediaElement::mMediaSource is non-null, does that mean this is an MSE element? * Do I want to be checking mReadyState/mNetworkState or something else?
* Yes, UnbindFromTree should be sufficiently accurate. * Yes, mMediaSource being non-null means that this is an MSE element * mReadyState == HAVE_CURRENT_DATA seems to be what should be checked, according to :cpearce. Needinfo me if you need more info. Thanks!
I'm not sure this works usefully. I don't know enough about MSE to construct a testcase where we actually do buffer indefinitely. I did verify that this records some values when watching youtube videos, and that the value changes when I go offline while loading a video. I'd appreciate not just review on this, but some dedicated QA with artificial testcases to make sure this can measure what you need. Marcia, is this something you can help with?
Attachment #8550508 - Flags: review?(cpearce)
Oh, I forgot: here's the try run for this, so you can do QA: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ef1e8294c1a
Comment on attachment 8550508 [details] [diff] [review] 1119947-mse-telemetry.patch Review of attachment 8550508 [details] [diff] [review]: ----------------------------------------------------------------- I think cajbir is a better reviewer for media telemetry stuff.
Attachment #8550508 - Flags: review?(cpearce) → review?(cajbir.bugzilla)
Attachment #8550508 - Flags: review?(cajbir.bugzilla) → review+
I don't yet know enough about this to provide test cases. Are you looking for a video that we can script with the MSE API? Are you looking for videos that have problems now, or had them in the past?
There are two parts: in abstract, I'd like to know that this records the correct things for an arbitrary MSE video in the following cases: * the video finishes playing and then we unload the page * the video is playing while we unload the page * the video gets stuck buffering forever and we unload the page Also in specific, I'd like to make sure that this can produce useful data *on youtube.com*, since that is the prime target that we care about in Firefox 36.
This will be discussed with the team at today's standup for next steps.
Chris - can you help Marcia figure out how to check if we have the right information?
I think relying on (mReadyState == HAVE_CURRENT_DATA) alone as I previously suggested to indicate playback has stalled due to buffering could be unreliable, as if the video element reached end of playback, or if the CPU was too slow we could we'd get a false positive. I think the most reliable way to determine that the user bailed out due to buffering in UnbindFromTree() would be to get the buffered ranges there, and iterate over them to find the range containing the currentTime, and if the currentTime is at the end of a range and we're not ended. The "at the end of a range" test needs to be fuzzy, as the ranges and currentTime are floating point.
bsmedberg: I'll write the HTMLMediaElement.cpp parts of that patch if you like.
I think we need to report telemetry in the suspend branch of HTMLMediaElement::SuspendOrResumeElement(), as that triggers immediately on tab close and navigation, whereas UnbindFromTree() only triggers on GC, which could be some time after the tab close/navigation, and the SuspendOrResumeElement() that happens on tab close/navigation will pause the media, contaminating the stats we're trying to collect.
Do we also want to track whether the page closes while seeking? That would track whether the user has given up because the seek failed.
Here's my WIP patch with the hopefully more accurate buffered detection. This needs to be tested. I need to get bug 1123535, so I probably won't get back to this for the next day or so.
"I need to get bug 1123535 landed"...
Assignee: benjamin → cajbir.bugzilla
This adds telemetry for when the user exits video playback during the following times: 1. The video has ended 2. The video is in the middle of a seek 3. The video is buffering 4. If the video is paused by the user, but not due to seeking or ended. More telemetry will be added in bug 1127646. I'm not really sure we want the telemetry on if the video is paused by the user but I left that in from the original patch. I don't report the 'paused' if we are in the middle of the seek or ended as that's reported separately anyway. Review request for :cpearce for the media stuff and :bsmedberg for the telemetry stuff. I've tested this and it logs to the nsMediaElement logger correctly when exiting the video at the right times.
Comment on attachment 8563892 [details] [diff] [review] Add telemetry for MSE video actions Review of attachment 8563892 [details] [diff] [review]: ----------------------------------------------------------------- Almost, but we need to either fix dormant mode to not interfere with this, or write another patch to make us immune to the damage wrought by us transitioning through dormant state before we shutdown. ::: dom/html/HTMLMediaElement.cpp @@ +2554,5 @@ > + // 1. If the video has ended > + // 2. If the video is in the middle of a seek > + // 3. If the video is buffering > + // 4. If the video is paused by the user, but not due to seeking or ended. > + bool seeking = Seeking(); I tested on YouTube on Windows. The fact that on Windows we go dormant before shutting down (which is something we should fix IMO) is messing things up here. When I close the tab, we transition through dormant mode (thanks to calling mDecoder->NotifyOwnerActivityChanged() inside HTMLMediaElement::NotifyOwnerDocumentActivityChanged(), which runs before SuspendOrResumeElement() where we report the telementry, and going dormant changed the state, messing up the reporting. If the video is paused and I close the tab, |seeking| here is reported as true and |paused| is reported as false. That's the opposite of what's the intended here? Even if the video is playing, |seeking| is true. If I set media.decoder.heuristic.dormant.enabled=false, everything works as I'd expect. So before we can land this, someone will need to prevent us entering dormant mode before shutting down, when we know we're about to shutdown. @@ +2575,5 @@ > + const double errorMargin = 0.05; > + double t = CurrentTime(); > + TimeRanges::index_type index = ranges->Find(t, errorMargin); > + ErrorResult ignore; > + auto s = ranges->Start(index, ignore); I keep getting told off by bz for using auto for non nightmarish-template-hell types. So I feel compelled to pay it forward here and ask you to fully specify the simple type here. @@ +2587,5 @@ > + LOG(PR_LOG_DEBUG, ("%p VIDEO_MSE_STALLED_AT_UNLOAD = %d", this, stalled)); > + Telemetry::Accumulate(Telemetry::VIDEO_MSE_STALLED_AT_UNLOAD, stalled); > + > + LOG(PR_LOG_DEBUG, ("%p VIDEO_MSE_SEEKING_AT_UNLOAD = %d", this, seeking)); > + Telemetry::Accumulate(Telemetry::VIDEO_MSE_SEEKING_AT_UNLOAD, seeking); Style nit: report seeking up near where you declare the `seeking` variable, so that variables are declared close to where they're used.
Attachment #8563892 - Flags: review?(cpearce) → review-
Address review comments. In HTMLMediaElement::NotifyOwnerDocumentActivityChanged I check if we are being destroyed and if so don't call the decoder to do the dormant code. I tested this on win64 to ensure it prevents the seek being true when closing the tab. :bsmedberg for the telemetry code.
Comment on attachment 8564856 [details] [diff] [review] Fix with review comments addressed Review of attachment 8564856 [details] [diff] [review]: ----------------------------------------------------------------- Almost... There's still a problem caused by dormant mode here... If the video element is dormant (i.e. has been non-playing and non-visible for 60+ seconds) and we close the tab, the stats are incorrectly reported; the video is reported as seeking and non-paused at unload. STR: 1. Set media.decoder.heuristic.dormant.timeout=0 to make this faster to reproduce 2. Open video. Pause video once it starts playing. 3. Change tabs so that video is in background 4. Observe "Change machine state from DECODING to DORMANT" in NSPR MediaDecoder logs. 5. Close the video's tab. Observe in NSPR logs: VIDEO_MSE_SEEKING_AT_UNLOAD = 1 VIDEO_MSE_PAUSED_AT_UNLOAD = 0 I would expect the opposite. When a non-visible media pauses or ends, or when a paused/ended media is made non-visible, we start a timer which will make the video dormant. Here's the backtrace when the timer runs: xul.dll!mozilla::MediaDecoderStateMachine::SetDormant(bool aDormant) Line 1501 C++ xul.dll!mozilla::MediaDecoder::UpdateDormantState(bool aDormantTimeout, bool aActivity) Line 189 C++ xul.dll!mozilla::MediaDecoder::DormantTimerExpired(nsITimer * aTimer, void * aClosure) Line 209 C++ xul.dll!nsTimerImpl::Fire() Line 631 C++ xul.dll!nsTimerEvent::Run() Line 729 C++ Also, when media.decoder.heuristic.dormant.timeout=0, we'll still enter dormant on the way to shutdown. That's an edge case in practise, as the user would have to get lucky with the timer firing exactly at the time when the user closes the tab, but we should handle it I think.
Attachment #8564856 - Flags: review?(cpearce) → review-
Comment on attachment 8564856 [details] [diff] [review] Fix with review comments addressed What is the reason for having a bunch of separate histograms instead of one enumerated? Is that because a video can be in multiple states (stalled and seeking?) and we want to be able to record that separately? If that's the case, this is ok, although if you can really only be in one state at unload, I'd prefer a single enumerated histogram.
Attachment #8564856 - Flags: review?(benjamin) → review+
Tracking all MSE P1 bugs for Firefox 37.
* Fix dormancy issue where a dormant video was always reported as seeking. * Fix case where UpdateDormantState was called from timer * Change telemetry to use an enumeration instead of separate values. r+ carried forward from :bsmedberg.
Comment on attachment 8566190 [details] [diff] [review] Fix with review comments addressed Review of attachment 8566190 [details] [diff] [review]: ----------------------------------------------------------------- Tested, and it works here. I was unable to hit a stalled condition in my testing, but the other conditions behave as expected. ::: dom/html/HTMLMediaElement.cpp @@ +2586,5 @@ > + double t = CurrentTime(); > + TimeRanges::index_type index = ranges->Find(t, errorMargin); > + ErrorResult ignore; > + double s = ranges->Start(index, ignore); > + double e = ranges->End(index, ignore); double s and e are unused here now.
Attachment #8566190 - Flags: review?(cpearce) → review+
Updated with fix for breakage: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e43492bc28c2
Carrying forward r+ from cpearce and bsmedberg. Removes unused variables that cpearce noticed and an additional one picked up by try build failure.
This is an MSE P1 that I assume we want for Beta 1. Can you please request uplift before Monday?
Rebase around bug 1091140 for landing on 37. Cajbir, can you double-check this please?
Attachment #8567358 - Flags: feedback?(cajbir.bugzilla)
Comment on attachment 8567358 [details] [diff] [review] Backport to 37 Approval Request Comment [Feature/regressing bug #]: MSE [User impact if declined]: Less effective testing. [Describe test coverage new/current, TreeHerder]: Landed on m-c. [Risks and why]: This adds telemetry reporting of media decoder state at teardown to help diagnose issues. It requires a new method in HTMLMediaElement, but is straightforward. I'm not worried about the risk, and it's better to get it in sooner than later. [String/UUID change made/needed]: None.
Attachment #8567358 - Flags: approval-mozilla-aurora?
Attachment #8567358 - Flags: feedback?(cajbir.bugzilla) → feedback+
Comment on attachment 8567358 [details] [diff] [review] Backport to 37 I agree about the need to land this sooner rather than later as we need this to get some objective measures for MSE. Aurora+
Attachment #8567358 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.