Closed Bug 1162251 Opened 9 years ago Closed 9 years ago

WebRTC H264 regression in packetization

Categories

(Core :: WebRTC: Networking, defect, P1)

38 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox37 --- unaffected
firefox38 + fixed
firefox38.0.5 + fixed
firefox39 + fixed
firefox40 + fixed
firefox-esr38 38+ fixed
relnote-firefox --- 38+

People

(Reporter: ehugg, Assigned: jesup)

Details

(Keywords: regression)

Attachments

(2 files)

Bug 1050461 fixed a packetization problem with H264 which seems to regressed in FF38+ which has the webrtc40 merge.

This causes Firefox to have difficulty reading the stream from a Spark native client.  The video will show a frame every few seconds even with minimal packet loss.  FF37 does not have this problem and FF38+ does not have the problem when connected via Spark to another Firefox.  Symptoms show with either OpenH264 v1.3 or v1.4.
Assignee: nobody → rjesup
Rank: 5
Priority: -- → P1
Component: WebRTC: Audio/Video → WebRTC: Networking
[Tracking Requested - why for this release]: Tracking for 39+.  We should also track this for 38 since that seems to be where the regression happened.
Attached patch Debugging and some cleanup work — — Splinter Review
Run with webrtc_trace:65535,gmp:5 and WEBRTC_TRACE_FILE=nspr
The changes here don't seem to fix anything important, though they may be useful long-term.
This mostly fixes it for me - it still will freeze up, but it's *much* happier with this patch.  This avoids it considering a standalone 1-packet NAL as a separate 'frame'/session_info, and failing to add it to the rest of the multi-NAL frame/session_info.
Comment on attachment 8602516 [details] [diff] [review]
Fix WebRTC jitter buffer ignoring partial frames if the packet holds a complete NAL

Effectively a 1-liner (all else is indents).  Note: not a sufficient fix I believe as I still get lockup after a short while, and after that just little bits, and the existing code was clearly wrong.
Attachment #8602516 - Flags: review?(mzanaty)
Attachment #8602516 - Flags: review?(ethanhugg)
Attachment #8602516 - Flags: feedback?(pkerr)
Severity: normal → critical
I applied this on top of mozilla-beta and it makes the video work much better in this case.  I'm running a try of opt builds so I can get others here to try it out.:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2b047651e5da
Comment on attachment 8602516 [details] [diff] [review]
Fix WebRTC jitter buffer ignoring partial frames if the packet holds a complete NAL

Review of attachment 8602516 [details] [diff] [review]:
-----------------------------------------------------------------

Tested this patch with builds of M-C and Beta and got smooth video for 10min+ to Spark native clients on Win and OSX.  On M-C I had to also backout the TMMBR patch which landed in 40, but that will be handled in another bug.
Attachment #8602516 - Flags: review?(ethanhugg) → review+
Comment on attachment 8602516 [details] [diff] [review]
Fix WebRTC jitter buffer ignoring partial frames if the packet holds a complete NAL

Review of attachment 8602516 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/trunk/webrtc/modules/video_coding/main/source/jitter_buffer.cc
@@ +647,5 @@
> +  if (*frame)
> +    return kNoError;
> +  *frame = decodable_frames_.FindFrame(packet.seqNum, packet.timestamp);
> +  if (*frame && (*frame)->GetState() != kStateComplete)
> +    return kNoError;

There seems to be a set of built-in assumptions around the upper level code not calling FindFrame for single NAL packets based on the comment in FrameList::FindFrame. But, if there is no timestamp match then the call to FindFrame should just return NULL: the "simple version" should still work.
Attachment #8602516 - Flags: feedback?(pkerr) → feedback+
Another attempt to get builds for others to test.  This time on top of mozilla-release:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ce68af7b81f
I've been running this on top of m-r (aka 38) in a debug build for 2 hours now without any freezes that I noticed.  I am having some freezes on m-inbound, but that's with my additional (largish) debug/cleanup/disable-tmmbr patch and can be dealt with later (if they're real).  If there are other bugs with inbound we'll deal with those separately.

In the meantime I'm going to land this on inbound and request uplift (and more manual testing).  Cisco people - please try this (you can start with Ethan's mozilla-release/FF38 Try build)

Note that trying this as-is on inbound may fail after a short while due (apparently) to the tmmbr code - the send bandwidth from the other end ramped down to 72Kbps and then the other end stopped sending video at all (@ ~1:15).  We'll investigate tmmbr with Cisco and if needed back it out for now or fix it.
Flags: needinfo?(mzanaty)
Flags: needinfo?(ethanhugg)
Flags: needinfo?(dbenham)
Attachment #8602516 - Flags: review?(mzanaty) → feedback?(mzanaty)
Comment on attachment 8602516 [details] [diff] [review]
Fix WebRTC jitter buffer ignoring partial frames if the packet holds a complete NAL

Approval Request Comment
[Feature/regressing bug #]: webrtc.org 40 landing (near start of 38)

[User impact if declined]: Inability to interoperate with the major user of OpenH264 - Cisco's native clients.  Inbound video will freeze and only rarely show a frame.  This is a regression from 37 and would be a major problem for the product.

[Describe test coverage new/current, TreeHerder]: requires manual testing against an external executable and access to Cisco's servers.  Tested for multiple hours against mozilla-release.  We may be able to add some slicing tests in the future with a standalone client.  Also, on 40 inbound, I'm seeing no freezes now either.

[Risks and why]: Extremely low risk - removes an if() which is there purely as an optimization for VP8 (don't bother searching for an active frame if the frame is marked "complete", but with sliced mode1 H264, you can have complete NALs that are only part of a frame.

This is a one-liner, if you ignore indent changes.

[String/UUID change made/needed]: none
Attachment #8602516 - Flags: approval-mozilla-release?
Attachment #8602516 - Flags: approval-mozilla-beta?
Attachment #8602516 - Flags: approval-mozilla-aurora?
(In reply to Randell Jesup [:jesup] from comment #9)
> I've been running this on top of m-r (aka 38) in a debug build for 2 hours
> now without any freezes that I noticed.  I am having some freezes on
> m-inbound, but that's with my additional (largish)
> debug/cleanup/disable-tmmbr patch and can be dealt with later (if they're
> real).  If there are other bugs with inbound we'll deal with those
> separately

I see no freezes with just this patch (and a disable of TMMBR) on 40/inbound debug.
I tested this with local builds on top of M-C, Beta and Release.  Got smooth video of over 10mins connected to the Windows version of the Spark native client.  Seems as good as 37 now at least.
Flags: needinfo?(ethanhugg)
video decoding via Spark windows web client works for me, too.
I have 12 minutes and counting of smooth video with the windows try build with this fix, and the Spark native OS X client on the other side.
Comment on attachment 8602516 [details] [diff] [review]
Fix WebRTC jitter buffer ignoring partial frames if the packet holds a complete NAL

r+. Works for receiving video for 20+ minutes from native spark clients (osx,ios) using the win try build. Safe, simple fix to avoid discarding subsequent single NAL units in the same frame from mode 0 senders. This may not fix other NAL patterns (e.g. mode 1) or packet loss/reorder cases. Further testing and fixes may be needed to cover all cases in separate bug(s).
Flags: needinfo?(mzanaty)
Flags: needinfo?(dbenham)
Attachment #8602516 - Flags: review+
Attachment #8602516 - Flags: feedback?(mzanaty) → feedback+
(In reply to mzanaty from comment #16)
> Comment on attachment 8602516 [details] [diff] [review]
> Fix WebRTC jitter buffer ignoring partial frames if the packet holds a
> complete NAL
> 
> r+. Works for receiving video for 20+ minutes from native spark clients
> (osx,ios) using the win try build. Safe, simple fix to avoid discarding
> subsequent single NAL units in the same frame from mode 0 senders. This may
> not fix other NAL patterns (e.g. mode 1) or packet loss/reorder cases.
> Further testing and fixes may be needed to cover all cases in separate
> bug(s).

Thanks.  This is mode 1; my logs show we're getting sliced H.264 in mode 1, with many slices being 2-5KB long (using mode-1 FuAs).  I've done 3 and 4 hour tests on mozilla-release and mozilla-inbound cleanly.  It's certainly possible there are some edge cases with loss that aren't handled as well as they can be, or where we could do better partial decoding of a sliced frame.
Comment on attachment 8602516 [details] [diff] [review]
Fix WebRTC jitter buffer ignoring partial frames if the packet holds a complete NAL

OK, let's take it in release & the 38.0 relbranch release ( GECKO380_2015050320_RELBRANCH )
Attachment #8602516 - Flags: approval-mozilla-release?
Attachment #8602516 - Flags: approval-mozilla-release+
Attachment #8602516 - Flags: approval-mozilla-beta?
Attachment #8602516 - Flags: approval-mozilla-aurora?
Attachment #8602516 - Flags: approval-mozilla-aurora+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
(In reply to Randell Jesup [:jesup] from comment #17)
> (In reply to mzanaty from comment #16)
> > Comment on attachment 8602516 [details] [diff] [review]
> > Fix WebRTC jitter buffer ignoring partial frames if the packet holds a
> > complete NAL
> > 
> > r+. Works for receiving video for 20+ minutes from native spark clients
> > (osx,ios) using the win try build. Safe, simple fix to avoid discarding
> > subsequent single NAL units in the same frame from mode 0 senders. This may
> > not fix other NAL patterns (e.g. mode 1) or packet loss/reorder cases.
> > Further testing and fixes may be needed to cover all cases in separate
> > bug(s).
> 
> Thanks.  This is mode 1; my logs show we're getting sliced H.264 in mode 1,
> with many slices being 2-5KB long (using mode-1 FuAs).  I've done 3 and 4
> hour tests on mozilla-release and mozilla-inbound cleanly.  It's certainly
> possible there are some edge cases with loss that aren't handled as well as
> they can be, or where we could do better partial decoding of a sliced frame.

For the long-term, we should support all possible packets that an endpoint advertising mode 1 should support (i.e., single-NAL, FU-As, and STAP-As) to be strictly standard compliant.  It would certainly prevent this kind of "regression" in the future.

Open264 made a change to split large frames into multiples large slices (2-4). This is why you are starting to see a mix of FU-As and single-NALs for the same frame.
un-marking ESR38 because I'm not sure where/when it's pulled from
(In reply to Katerina Shiffer from comment #19)
> For the long-term, we should support all possible packets that an endpoint
> advertising mode 1 should support (i.e., single-NAL, FU-As, and STAP-As) to
> be strictly standard compliant.  It would certainly prevent this kind of
> "regression" in the future.

Sure.  We already test FuA's with the fake-h264 plugin; I can extend that to test multiple slices, etc.  It's never a perfect test of course.
 
> Open264 made a change to split large frames into multiples large slices
> (2-4). This is why you are starting to see a mix of FU-As and single-NALs
> for the same frame.

Was this post-1.4 release?  I haven't seen this in 1.4; 1.3 incorporated support for max-length (needed for Mode 0) but doesn't mandate a maximum slice size (and Firefox sets it to something huge for mode 1).  Perhaps it's that the native client started using this new feature in 1.3 and later for mode 1?
Flags: needinfo?(kshiffer)
(In reply to Randell Jesup [:jesup] from comment #23)
> Perhaps it's that the native client started using this new feature in 1.3 and later for mode 1?

Right, the native client using OpenH264 configures it differently in mode 1 (2-4 slices) than Firefox (1 slice).
(In reply to mzanaty from comment #25)
> (In reply to Randell Jesup [:jesup] from comment #23)
> > Perhaps it's that the native client started using this new feature in 1.3 and later for mode 1?
> 
> Right, the native client using OpenH264 configures it differently in mode 1
> (2-4 slices) than Firefox (1 slice).

That's what I figured.  Any reason why?  To shave a smidge off of latency-to-the-wire?  Given on most systems unless resolution (and bandwidth) is high, encoding is <<10ms, it's not going to shave much off.

I'll revise the test to try slicing as well as large FuAs
> > Right, the native client using OpenH264 configures it differently in mode 1
> > (2-4 slices) than Firefox (1 slice).
> 
> That's what I figured.  Any reason why?  To shave a smidge off of
> latency-to-the-wire?  Given on most systems unless resolution (and
> bandwidth) is high, encoding is <<10ms, it's not going to shave much off.


for both latency and (more importantly) resilience to loss.  in unreliable networks (e.g., WiFi, cellular), we were getting frequent video freezes because if one FU-A packet is lost, the whole frame is lost.  with partial frame decode, we can at least apply some error concealment to recover the frame.  ideally, i'd like to get more non-uniform slices (each roughly the MTU size) to get the max resilience benefit.  but i believe 2-4 uniform slices is what was implemented/used in spark.

also, keep in mind that encoding on mobile platforms takes quite a bit more than what's quoted above. 

safest is if we plan to be fully compliant (i.e., support for decoding FU-As, STAP-As, and single-NALs) since we don't know what is on the other side and what their preferred sending mode is.
Flags: needinfo?(kshiffer)
Comment on attachment 8602516 [details] [diff] [review]
Fix WebRTC jitter buffer ignoring partial frames if the packet holds a complete NAL

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration:
Major regression of a feature important to externals; fix was uplifted to 38.0 release and shouldn't regress in ESR38

User impact if declined: See previous approval requests

Fix Landed on Version: 38

Risk to taking this patch (and alternatives if risky): Very low risk

String or UUID changes made by this patch: none
Attachment #8602516 - Flags: approval-mozilla-esr38?
ESR 38.1 (default):
https://hg.mozilla.org/releases/mozilla-esr38/rev/1ffd88fc326c

ESR 38.0.1 (GECKO380esr_2015050513_RELBRANCH):
https://hg.mozilla.org/releases/mozilla-esr38/rev/95ee75b86403

(With in-person approval from lmandel)
Target Milestone: --- → mozilla40
Comment on attachment 8602516 [details] [diff] [review]
Fix WebRTC jitter buffer ignoring partial frames if the packet holds a complete NAL

As Ryan noted, approved for ESR 38.0.1.
Attachment #8602516 - Flags: approval-mozilla-esr38? → approval-mozilla-esr38+
Release Note Request (optional, but appreciated)
[Why is this notable]: Major regression of a feature important to externals; fix was uplifted to 38.0 release and shouldn't regress in ESR38
[Suggested wording]: Fixed: WebRTC H264 streams from Spark native clients may not be read correctly
[Links (documentation, blog post, etc)]:

jesup - Can you confirm that the note is correct or suggest an alternative?
Flags: needinfo?(rjesup)
When creating the release notes for ESR 38.0.1 I realized that we have a limitation that ESR and Firefox are considered the same when it comes to release notes. This means that I can't create separate notes for Firefox 38.0.1 and ESR 38.0.1. Given that this issue is fixed in Firefox 38.0, it seems somewhat confusing to note the fix in the Firefox 38.0.1 notes (even with a qualifier that this is ESR specific). Is this low level enough that we can reasonably skip the note? Given that Hello is disabled in ESR and I don't think this functionality existed in Firefox 31 (previous ESR release), perhaps we can just leave this out of the relnotes.

jesup - wdyt?
OpenH264 is never used with Hello in the first place; it's used for services that interoperate with other non-browser systems (and perhaps with some browsers in the future), like ciscospark.com.  So Hello being disabled in ESR isn't relevant.

I think relnote it and mention the difference

[Suggested wording]: Fixed: WebRTC H264 video streams from CiscoSpark native clients are not decoded correctly.  (Note: Fix not landed in ESR38.0.1; planned for ESR38.0.2)
Flags: needinfo?(rjesup) → needinfo?(lmandel)
(In reply to Randell Jesup [:jesup] from comment #33)

> [Suggested wording]: Fixed: WebRTC H264 video streams from CiscoSpark native
> clients are not decoded correctly.  (Note: Fix not landed in ESR38.0.1;
> planned for ESR38.0.2)

Ooops: retry
[Suggested wording]: Fixed: WebRTC H264 video streams from CiscoSpark native clients are not decoded correctly.  (Note: Fix landed in ESR38.0.1; was already landed in Firefox 38.0)
Thanks for the clarification jesup. I'll go with your note.
Flags: needinfo?(lmandel)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: