Closed Bug 970742 Opened 9 years ago Closed 9 years ago

Add metrics/telemetry for errors and error recovery

Categories

(Core :: WebRTC, defect, P2)

x86_64
Windows 7
defect

Tracking

()

RESOLVED FIXED
mozilla32

People

(Reporter: mreavy, Assigned: jesup)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [p=3, 1.5:p1, ft:webrtc])

Attachments

(3 files, 2 obsolete files)

We want to measure how effective our congestion control, error correction, and compositor rendering (Does the compositor render every frame it's given) is.
Blocks: 970426
Congestion control measurements:  (mostly only useful in Telemetry and about:webrtc, not Talos unless/until we have real network simulation)
* Time from call start until within (say) 10% of max stable bitrate (not max instantaneous).
* percentage of time in the different modes of congestion control (increase, stable, drain)
* Average and max-stable RTT minus minimum stable RTT (basically try to subtract out "real" RTT to determine how much buffering is occurring in the network on top of that).
Error Correction: what modes it chooses for error recovery (rexmit, full-frame PLI, older-reference-frame).
How long it took to recover in frames/ms (histogram). More/mostly useful in about:webrtc and telemetry than Talos.
Assigning to pkerr, but this is tentative - jib or I might end up doing it
Assignee: nobody → paulrkerr
Priority: -- → P1
Whiteboard: [p=5, 1.5:p1, ft:webrtc]
Target Milestone: --- → mozilla32
Priority: P1 → P2
Whiteboard: [p=5, 1.5:p1, ft:webrtc] → [p=3, 1.5:p1, ft:webrtc]
This should be applicable to B2G as well. Isn't this going into the v2.0 release? If so, I think we should start tracking this.
We are tracking this as part of the WebRTC Platform release for Fx 32 (which aligns with v2.0).  Please see the following if you want more details: https://wiki.mozilla.org/Media/WebRTC#Sprint:_mozilla32_WebRTC_Platform
Blocks: 1011683
I'll file followup bugs for the stuff I split off from this one, which was a bit of a grab-bag of wants.
Assignee: pkerr → rjesup
Summary: Add metric tests for congestion control, error correction, compositor rendering → Add metrics/telemetry for errors and error recovery
Attachment #8436020 - Attachment is obsolete: true
Attachment #8436242 - Flags: review?(jib)
Attachment #8436243 - Flags: review?(jib)
Comment on attachment 8436242 [details] [diff] [review]
Monitor decoder error state to enable recording errors and error recovery times

Review of attachment 8436242 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm. Some comments/questions.

::: media/webrtc/trunk/webrtc/modules/video_coding/main/source/receiver.cc
@@ +60,3 @@
>    } else {
>      state_ = kPassive;
> +    receiveState_ = kReceiveStateWaitingKey; // XXX Initial?

Maybe remove comment or clarify?

I find uncertain-sounding comments ambiguous, for instance, here I can't tell if you're uncertain or if you're soliciting opinions with this comment. You're inventing these states elsewhere, so I suspect the former, which I think means you get to pick. If not, please elaborate.

@@ +391,5 @@
>    }
>    // Update the dual receiver state.
>    if (frame.Complete() && frame.FrameType() == kVideoFrameKey) {
>      UpdateState(kPassive);
> +    return;

If frame.MissingFrame() then this now sets state to kPassive whereas before it set it to kReceiving. Intentional?

::: media/webrtc/trunk/webrtc/modules/video_coding/main/source/receiver.h
@@ +74,5 @@
>    // Dual decoder.
>    bool DualDecoderCaughtUp(VCMEncodedFrame* dual_frame,
>                             VCMReceiver& dual_receiver) const;
>    VCMReceiverState State() const;
> +  VideoReceiveState ReceiveState() const;

Naming is pretty dense here. A receiver with two "receiver-states". Can we come up with better names?

::: media/webrtc/trunk/webrtc/modules/video_coding/main/source/video_receiver.cc
@@ +432,3 @@
>      } else {
>        _dualReceiver.Reset();
> +      SetReceiveState(kReceiveStateWaitingKey); // XXX is this correct?

I don't know.

::: media/webrtc/trunk/webrtc/video/receive_statistics_proxy.cc
@@ +67,5 @@
>  
> +void ReceiveStatisticsProxy::ReceiveStateChange(const int video_channel,
> +                                                VideoReceiveState state) {
> +  CriticalSectionScoped cs(lock_.get());
> +  printf("*** New state for %d: %d (was %d)\n", video_channel, state, receive_state_);

remove printf
Attachment #8436242 - Flags: review?(jib) → review+
Comment on attachment 8436243 [details] [diff] [review]
Log and send to telemetry video errors and error recovery time in webrtc

Review of attachment 8436243 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm. Disclaimer: I wasn't able to deduce a state transition diagram for the new receive state. A comment somewhere on that might be helpful.

::: media/webrtc/signaling/src/media-conduit/VideoConduit.cpp
@@ +112,5 @@
> +      if (mStatsProxy) {
> +        mStatsProxy->EndOfCallStats();
> +      }
> +      mPtrRTP->DeregisterReceiveChannelRtcpStatisticsCallback(mChannel, mStatsProxy);
> +      mPtrRTP->DeregisterReceiveChannelRtpStatisticsCallback(mChannel, mStatsProxy);

Do you want to call these when mStatsProxy == nullptr? If not, perhaps put this inside the if above.

@@ +395,5 @@
>                    mPtrViEBase->LastError());
>        return kMediaConduitRTCPStatusError;
>      }
> +
> +    mStatsProxy = new WebrtcReceiveStats(0, // XXX local_ssrc?

Odd comment. Can we find out more?

@@ +404,5 @@
> +    if(mPtrRTP->RegisterReceiveChannelRtpStatisticsCallback(mChannel, mStatsProxy) != 0)
> +    {
> +      CSFLogError(logTag,  "%s RegisterReceiveChannelRtpStatistics Failed %d ", __FUNCTION__,
> +                  mPtrViEBase->LastError());
> +      return kMediaConduitRtpStatsError;

Looks like these can't fail, so these error definitions are perhaps overkill?

@@ +1133,5 @@
>  }
>  
>  int WebrtcVideoConduit::SendRTCPPacket(int channel, const void* data, int len)
>  {
> +  CSFLogDebug(logTag,  "%s : 0x%02x %d - channel %d , len %d, transmitting: %d", __FUNCTION__,

Why dump the first byte of data? A comment explaining the intent would be helpful.

@@ +1381,5 @@
>  
> +void WebrtcReceiveStats::ReceiveStateChange(const int aChannel,
> +                                            webrtc::VideoReceiveState aState)
> +{
> +  printf("*** New state for %d: %d (was %d)\n", aChannel, aState, mReceiveState);

remove printf

@@ +1386,5 @@
> +#ifdef MOZILLA_INTERNAL_API
> +  if (mFirstDecodeTime.IsNull()) {
> +    mFirstDecodeTime = TimeStamp::Now();
> +  }
> +  if (mReceiveState == webrtc::kReceiveStateNormal || mReceiveState == webrtc::kReceiveStateInitial) {

switch?

@@ +1388,5 @@
> +    mFirstDecodeTime = TimeStamp::Now();
> +  }
> +  if (mReceiveState == webrtc::kReceiveStateNormal || mReceiveState == webrtc::kReceiveStateInitial) {
> +    // in a normal state
> +    if (aState != webrtc::kReceiveStateNormal && aState != webrtc::kReceiveStateInitial) { // paranoia

Isn't receiving a state-change of kReceiveStateInitial at least grounds for an assert?

Since I think you invent these states in the other patch, I'd prefer to see a firmer state transition model, perhaps with assertions instead of silent-running paranoia if()s.

@@ +1396,5 @@
> +      }
> +    }
> +  } else {
> +    // not in a normal state
> +    if (aState == webrtc::kReceiveStateNormal || aState == webrtc::kReceiveStateInitial) { // paranoia

Same question here.

@@ +1402,5 @@
> +      if (mReceiveState == webrtc::kReceiveStatePreemptiveNACK) {
> +        mRecoveredBeforeLoss++;
> +      } else if (!mReceiveFailureTime.IsNull()) { // safety
> +        TimeDuration timeDelta = TimeStamp::Now() - mReceiveFailureTime;
> +        CSFLogError(logTag, "Video error duration: %u ms", (uint32_t) timeDelta.ToMilliseconds());

Maybe use static_cast<> here or %f ?

@@ +1403,5 @@
> +        mRecoveredBeforeLoss++;
> +      } else if (!mReceiveFailureTime.IsNull()) { // safety
> +        TimeDuration timeDelta = TimeStamp::Now() - mReceiveFailureTime;
> +        CSFLogError(logTag, "Video error duration: %u ms", (uint32_t) timeDelta.ToMilliseconds());
> +        Telemetry::Accumulate(Telemetry::WEBRTC_VIDEO_ERROR_RECOVERY_MS, timeDelta.ToMilliseconds());

Odd that this doesn't produce a warning. ToMilliseconds() returns double.

Perhaps consider AccumulateTimeDelta or a cast?

@@ +1424,5 @@
> +    if (callDelta.ToSeconds() != 0) {
> +      CSFLogError(logTag, "Video recovery before error per min %f",
> +                  (float) mRecoveredBeforeLoss/(callDelta.ToSeconds()/60));
> +      Telemetry::Accumulate(Telemetry::WEBRTC_VIDEO_RECOVERY_BEFORE_ERROR_PER_MIN,
> +                            (float) mRecoveredBeforeLoss/(callDelta.ToSeconds()/60));

I think a stack temporary to reduce chance of repetition bugs would be nice here.

@@ +1428,5 @@
> +                            (float) mRecoveredBeforeLoss/(callDelta.ToSeconds()/60));
> +      CSFLogError(logTag, "Video recovery after error per min %f",
> +                  (float) mRecoveredLosses/(callDelta.ToSeconds()/60));
> +      Telemetry::Accumulate(Telemetry::WEBRTC_VIDEO_RECOVERY_AFTER_ERROR_PER_MIN,
> +                            (float) mRecoveredLosses/(callDelta.ToSeconds()/60));

same here

@@ +1432,5 @@
> +                            (float) mRecoveredLosses/(callDelta.ToSeconds()/60));
> +      CSFLogError(logTag, "Video error time percentage %f%%",
> +                  (float) mTotalLossTime.ToSeconds()*100/(callDelta.ToSeconds()));
> +      Telemetry::Accumulate(Telemetry::WEBRTC_VIDEO_ERROR_TIME_PERMILLE,
> +                            (float) mTotalLossTime.ToSeconds()*1000/(callDelta.ToSeconds()));

and here

::: toolkit/components/telemetry/Histograms.json
@@ +5290,5 @@
> +    "high": 1000,
> +    "n_buckets": 200,
> +    "description": "Number of losses recovered after error per min"
> +  },
> +  "WEBRTC_VIDEO_ERROR_TIME_PERMILLE": {

Description says this is time in recovery, so maybe include "RECOVERY" somewhere in the name?
Attachment #8436243 - Flags: review?(jib) → review+
removed some no-longer-needed bits, carry forward r=jib
Attachment #8436242 - Attachment is obsolete: true
Replaces gecko changes for receive state monitoring with CodecStatistics object use
Attachment #8436410 - Flags: review+
Attachment #8436411 - Flags: review?(jib)
Comment on attachment 8436411 [details] [diff] [review]
Add receive state monitoring to webrtc CodecStatistics

Review of attachment 8436411 [details] [diff] [review]:
-----------------------------------------------------------------

r=me. Nits optional.

::: media/webrtc/signaling/src/media-conduit/CodecStatistics.cpp
@@ +92,5 @@
> +   * WaitingKey -> PreemptiveNACK
> +   * DecodingWithErrors -> PreemptiveNACK
> +   */
> +
> +  if (mReceiveState == webrtc::kReceiveStateNormal || mReceiveState == webrtc::kReceiveStateInitial) {

Two states to keep track of in lots of if-statements. Maybe:

switch(mReceiveState) {
  case kReceiveStateInitial:
  case kReceiveStateNormal:
     // in a normal state
     break;
  default:
     // no longer in a normal state
}

to break things up a bit? Also you're using namespace webrtc here, so you don't need webrtc::

@@ +94,5 @@
> +   */
> +
> +  if (mReceiveState == webrtc::kReceiveStateNormal || mReceiveState == webrtc::kReceiveStateInitial) {
> +    // in a normal state
> +    if (aState != webrtc::kReceiveStateNormal && aState != webrtc::kReceiveStateInitial) { // paranoia

Again not a fan of silent paranoia if() statements, but I like the transitions comment. Would prefer asserts to back it up though.

@@ +102,5 @@
> +      }
> +    }
> +  } else {
> +    // not in a normal state
> +    if (aState == webrtc::kReceiveStateNormal || aState == webrtc::kReceiveStateInitial) { // paranoia

webrtc:: and paranoia

@@ +112,5 @@
> +        TimeDuration timeDelta = TimeStamp::Now() - mReceiveFailureTime;
> +        CSFLogError(logTag, "Video error duration: %u ms",
> +                    static_cast<uint32_t>(timeDelta.ToMilliseconds()));
> +        Telemetry::Accumulate(Telemetry::WEBRTC_VIDEO_ERROR_RECOVERY_MS,
> +                              static_cast<uint32_t>(timeDelta.ToMilliseconds()));

Use AccumulateTimeDelta here.

::: toolkit/components/telemetry/Histograms.json
@@ +5330,5 @@
> +    "high": 10000,
> +    "n_buckets": 500,
> +    "description": "Time to recover from a video error in ms"
> +  },
> +  "WEBRTC_VIDEO_RECOVERY_BEFORE_ERROR_PER_MIN": {

This is also in the ReceiveStatisticsProxy patch in this bug, but I understand from IRC that this replaces that patch, so all good.
Attachment #8436411 - Flags: review?(jib) → review+
No longer blocks: webrtc_upstream_bugs
You need to log in before you can comment on or make changes to this bug.