Closed Bug 1075640 Opened 10 years ago Closed 10 years ago

WebRTC H.264 sporadic crash in WebrtcGmpVideoDecoder

Categories

(Core :: WebRTC: Audio/Video, defect)

33 Branch
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36
Tracking Status
firefox33 + fixed
firefox34 + fixed
firefox35 --- fixed
firefox36 --- fixed
firefox-esr31 --- unaffected

People

(Reporter: mzanaty, Assigned: ehugg)

References

Details

Attachments

(3 files, 3 obsolete files)

Seeing the following sporadic crash in WebrtcGmpVideoDecoder when testing real H.264 WebRTC services. Not able to reproduce in simple test apps.

https://crash-stats.mozilla.com/report/index/af34c7f1-657d-4d9b-9b29-2f9772140924
Looking at the crashing line 609:
http://hg.mozilla.org/releases/mozilla-beta/annotate/10a34974f747/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#l609

  *(reinterpret_cast<uint32_t*>(frame->Buffer())) = frame->Size();

Should this be frame->Size()-4 (to exclude the size itself)?
The memcpy size-4 is what got me thinking the size poked into the buffer header should also be size-4. What gets poked into the header should probably be the size of what remains excluding the size field itself. That's how most protocol/buffer formats work, including AVCC. But I'm not sure if this internal FF GMP format is expected to be exactly like AVCC or not.

If it is intended to be exactly like AVCC, we may have more problems, since the code is not scanning the entire buffer to replace all 4-byte AnnexB start codes (0x0) with 4-byte AVCC lengths. That is no problem for the OpenH264 decoder, since it expects and handles AnnexB start codes just fine. But I'm not sure if it can be a problem for any OMX HW decoder which may expect strict AVCC with no AnnexB start codes anywhere in the frame.
We may have a problem there, but this particular crash looks like a null deref.  My first guess would be that setting the members to null here:
http://hg.mozilla.org/releases/mozilla-beta/annotate/10a34974f747/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#l667
Is racing with the usage here:
http://hg.mozilla.org/releases/mozilla-beta/annotate/10a34974f747/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#l635

Perhaps dispatching to the GMP thread to set these to nullptr on release is needed, or I could be way off-base since I don't have a repro case.  Any hints on reproducing this?
Meant (0x00000001) not (0x0) in c#3.

To clarify the full pipeline of buffer mangling that happens from RTP to OpenH264 decoder:
1. RTP has no AnnexB start codes or AVCC lengths.
2. JB adds AnnexB start codes between NALs in the frame buffer.
3. GMP converts the leading AnnexB start code in the frame buffer to an AVCC length covering the entire frame (with potentially embedded AnnexB start codes if there are multiple NALs, which seems wrong).
4. OpenH264 GMP plugin wrapper (part of the OpenH264 project/download) converts AVCC lengths to AnnexB start codes (but due to 3 it only has the leading length to convert since it covers the entire frame).
5. OpenH264 decoder expects and handles AnnexB start codes if there are multiple NALs in the frame.

All of this works fine if line 609 is adjusted to size-4. But I'm not sure if it works fine for OMX HW decoders which may expect strict AVCC with no AnnexB start codes within the frame.

Also not sure if this size-4 issue is related to the crash at all. Don't see how since the buffer pointer would need to be invalid to cause the crash, regardless of what you're poking into that pointer.
(In reply to mzanaty from comment #5)
> Meant (0x00000001) not (0x0) in c#3.
> 
> To clarify the full pipeline of buffer mangling that happens from RTP to
> OpenH264 decoder:
> 1. RTP has no AnnexB start codes or AVCC lengths.
> 2. JB adds AnnexB start codes between NALs in the frame buffer.
> 3. GMP converts the leading AnnexB start code in the frame buffer to an AVCC
> length covering the entire frame (with potentially embedded AnnexB start
> codes if there are multiple NALs, which seems wrong).

This wasn't intended to be an AVCC length; it was intended to be an OMX length.  (Though it's still just as wrong.)  The code for receiving buffers back uses the same "length includes the size field" definition.  Unfortunately, it appears we implemented a different definition than OMX (or AVCC) uses, though the conversion is trivial.  We can look to switch later.

That code assumed it would get called with a single NAL (and even comments that it doesn't handle mode 0 data).  When session_info/etc changed to pass multiple NALs, this should have been adjusted.

The "better" way to support mode 0 data would likely be to use the FragmentationHeader stuff, IIRC how it works, instead of (re)scanning.

> All of this works fine if line 609 is adjusted to size-4. But I'm not sure
> if it works fine for OMX HW decoders which may expect strict AVCC with no
> AnnexB start codes within the frame.

In the plugin, the initial size is replaced with a start code; it doesn't look for multiple AVCC (or OMX-like GMP funky) units within the buffer even you you were to attempt to encode multiple units that way - it only converts the first to a start code.

This all needs some coordinated cleanup.

> Also not sure if this size-4 issue is related to the crash at all. Don't see
> how since the buffer pointer would need to be invalid to cause the crash,
> regardless of what you're poking into that pointer.

Exactly.
Ok, looking closely at the code, I see only one obvious possibility for a null-deref here:

aInputImage._length == 0

If so, CreateEmptyFrame will call DestroyBuffer() instead of allocating a buffer.

We could add a MOZ_ASSERT(aInputImage._length > 0) or if (aInputImage._length == 0) { return  WEBRTC_VIDEO_CODEC_OK; }

However, *why* that might be 0 is an interesting question.
Assignee: nobody → rjesup
Here's a stack from a debug version:

[10001] ###!!! ABORT: NULL segment: 'mSegment', file /Users/ehugg/mozilla/mozilla-beta/ipc/glue/Shmem.cpp, line 320
mozilla::ipc::Shmem::AssertInvariants() const+0x0000004A [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x007BF57A]
unsigned char* mozilla::ipc::Shmem::get<unsigned char>() const+0x00000019 [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x0123A509]
mozilla::gmp::GMPVideoEncodedFrameImpl::Buffer()+0x0000001C [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x020E396C]
mozilla::WebrtcGmpVideoDecoder::Decode_g(webrtc::EncodedImage const&, bool, webrtc::RTPFragmentationHeader const*, webrtc::CodecSpecificInfo const*, long long)+0x0000019D [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x00FC17BD]
mozilla::runnable_args_m_5_ret<mozilla::WebrtcGmpVideoDecoder*, int (mozilla::WebrtcGmpVideoDecoder::*)(webrtc::EncodedImage const&, bool, webrtc::RTPFragmentationHeader const*, webrtc::CodecSpecificInfo const*, long long), webrtc::EncodedImage, bool, web+0x000000BD [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x00FC303D]
mozilla::SyncRunnable::Run()+0x00000036 [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x00F3A1F6]
nsThread::ProcessNextEvent(bool, bool*)+0x000006AA [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x0016639A]
NS_ProcessNextEvent(nsIThread*, bool)+0x00000097 [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x00044957]
mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)+0x000003D5 [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x007BB9A5]
MessageLoop::RunInternal()+0x00000076 [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x00732586]
MessageLoop::RunHandler()+0x00000015 [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x00732495]
MessageLoop::Run()+0x0000002D [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x0073243D]
nsThread::ThreadFunc(void*)+0x00000166 [/Users/ehugg/mozilla/mozilla-beta/./obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x00164D16]
I can get this to happen now almost every time.  I could either set you up to see it happen or try out other ideas you have.   I'll be back on it this afternoon.
Can you catch it in gdb and look at aInputImage._length in the ::Decode_g() stack frame?
Also, try putting a breakpoint on WebrtcGMPVideoCodec.cpp:602 with a conditional of aInputImage._length == 0 and see where this came from and why.  Thanks!
Yes your guess is correct. aInputImage._length is zero in this case.

It's zero clear back here in the WebRTC video_receiver.
http://dxr.mozilla.org/mozilla-central/source/media/webrtc/trunk/webrtc/modules/video_coding/main/source/video_receiver.cc#465

Also when this happens a lot of these warnings pop out of the OpenH264 decoder:
PPS id is invalid!
SPS id is invalid!

So perhaps we're receiving junk or lost our place in the stream.

I put a check like this at the top of decode_g and it doesn't crash but the video gets pretty choppy:

  if (!aInputImage._length) {
    return WEBRTC_VIDEO_CODEC_ERROR;
  }

Also the team working on this claims that it worked OK a few weeks ago with 33.
Assignee: rjesup → ethanhugg
Status: NEW → ASSIGNED
Using the this length-check patch, call worked for a bit and got these results from about:webrtc using two instance of FF on the same machine.

Side that initiated the call:

PeerConnection:1412722593909716 (id=24 url=https://dev-2.wbx2.com/) 16:02:56 GMT-0700 (PDT)
ICE statistics
1412722593909716 (id=24 url=https://dev-2.wbx2.com/): stream1/audio
Local candidate	Remote candidate	ICE State	Priority	Nominated	Selected
192.168.1.2:52557/udp(host)	166.78.47.126:33488/udp(host)	failed	9149908167411368000
166.137.214.117:62502/udp(peerreflexive)	166.78.47.126:33488/udp(host)	succeeded	7996986662804521000	*	*
Local candidate addr	Type
192.168.1.2:52557/udp	host
166.137.214.117:32547/udp	serverreflexive
166.137.214.117:32542/udp	peerreflexive
Remote candidate addr	Type
166.78.47.126:33488/udp	host
1412722593909716 (id=24 url=https://dev-2.wbx2.com/): stream2/video
Local candidate	Remote candidate	ICE State	Priority	Nominated	Selected
192.168.1.2:52230/udp(host)	166.78.47.126:33568/udp(host)	failed	9149908167411368000
166.137.214.117:62502/udp(peerreflexive)	166.78.47.126:33568/udp(host)	succeeded	7996986662804521000	*	*
Local candidate addr	Type
192.168.1.2:52230/udp	host
166.137.214.117:35438/udp	serverreflexive
166.137.214.117:62502/udp	peerreflexive
Remote candidate addr	Type
166.78.47.126:33568/udp	host
Local SDP

v=0

o=Mozilla-SIPUA-33.0 21185 0 IN IP4 0.0.0.0

s=SIP Call

t=0 0

a=ice-ufrag:b3aa1151

a=ice-pwd:c4604c1ad0301b3ad9ae94bfc483b264

a=fingerprint:sha-256 00:C4:BB:7E:AC:8F:25:03:E9:F4:87:6F:30:28:B9:3A:D5:7B:42:E7:3E:42:62:DA:A0:6E:D2:1A:FD:A9:BE:D0

m=audio 52557 RTP/SAVPF 109 0 8 101

c=IN IP4 192.168.1.2

a=rtpmap:109 opus/48000/2

a=ptime:20

a=rtpmap:0 PCMU/8000

a=rtpmap:8 PCMA/8000

a=rtpmap:101 telephone-event/8000

a=fmtp:101 0-15

a=sendrecv

a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level

a=setup:actpass

a=candidate:0 1 UDP 2130379007 192.168.1.2 52557 typ host

a=candidate:0 2 UDP 2130379006 192.168.1.2 49455 typ host

a=rtcp-mux

m=video 52230 RTP/SAVPF 120 126 97

c=IN IP4 192.168.1.2

a=rtpmap:120 VP8/90000

a=rtpmap:126 H264/90000

a=fmtp:126 profile-level-id=42e01f;packetization-mode=1

a=rtpmap:97 H264/90000

a=fmtp:97 profile-level-id=42e01f

a=sendrecv

a=rtcp-fb:120 nack

a=rtcp-fb:120 nack pli

a=rtcp-fb:120 ccm fir

a=rtcp-fb:126 nack

a=rtcp-fb:126 nack pli

a=rtcp-fb:126 ccm fir

a=rtcp-fb:97 nack

a=rtcp-fb:97 nack pli

a=rtcp-fb:97 ccm fir

a=setup:actpass

a=candidate:0 1 UDP 2130379007 192.168.1.2 52230 typ host

a=candidate:0 2 UDP 2130379006 192.168.1.2 51049 typ host

a=rtcp-mux

Remote SDP

v=0

o=linus 0 0 IN IP4 166.78.47.126

s=-

t=0 0

a=ice-lite

m=audio 33488 RTP/AVP 109

c=IN IP4 166.78.47.126

b=TIAS:64000

a=sendrecv

a=rtpmap:109 opus/48000/2

a=fmtp:109 PROFILE=0;LEVEL=0;packetization-mode=0;level-asymmetry-allowed=1;parameter-add=1;maxaveragebitrate=64000;usedtx=0;stereo=1;useinbandfec=0;cbr=0

a=rtcp-mux

a=label:100

a=ice-ufrag:NT++ViZb

a=ice-pwd:nT2cTHzxKuhL1IeGsLy9BvhvV+dZvvs5

a=candidate:0 1 UDP 2130706431 166.78.47.126 33488 typ host

a=candidate:8 1 TCP 1962934271 166.78.47.126 33488 typ host tcptype passive

a=setup:passive

a=fingerprint:sha-1 49:5F:57:B5:99:ED:31:0A:05:AF:C6:AE:93:AA:34:02:3E:5B:FF:85

m=video 33568 RTP/AVP 126

c=IN IP4 166.78.47.126

b=TIAS:256000

a=sendrecv

a=rtpmap:126 H264/90000

a=fmtp:126 PROFILE=0;LEVEL=0;profile-level-id=42000C;packetization-mode=1;level-asymmetry-allowed=1;max-mbps=27600;max-fs=920;max-br=1000;parameter-add=1;usedtx=0;stereo=0;useinbandfec=0;cbr=0

a=rtcp-fb:* nack pli

a=rtcp-fb:* ccm fir

a=rtcp-mux

a=label:200

a=ice-ufrag:aCzvw9fW

a=ice-pwd:mliaZK7PxfKG9mmkI2dEwz8TQYCiLh+2

a=candidate:0 1 UDP 2130706431 166.78.47.126 33568 typ host

a=candidate:8 1 TCP 1962934271 166.78.47.126 33568 typ host tcptype passive

a=setup:passive

a=fingerprint:sha-1 49:5F:57:B5:99:ED:31:0A:05:AF:C6:AE:93:AA:34:02:3E:5B:FF:85

RTP statistics
outbound_rtp_audio_0
Local: 16:02:56 GMT-0700 (PDT) outboundrtp SSRC: 1830038791 Sent: 18992 packets (1145.35 Kb)
Remote: 16:02:55 GMT-0700 (PDT) inboundrtp SSRC: 1830038791 Received: 17457 packets (664.87 Kb) Lost: 1472 Jitter: 0 RTT: 408 ms
outbound_rtp_video_1
Encoder: Avg. bitrate: 1.32 Mbps (0.47 SD) Avg. framerate: 16.21 fps (1.19 SD) Dropped frames: 4294935271
Local: 16:02:56 GMT-0700 (PDT) outboundrtp SSRC: 1400586114 Sent: 55936 packets (60566.03 Kb)
Remote: 16:02:55 GMT-0700 (PDT) inboundrtp SSRC: 1400586114 Received: 46438 packets (49249.68 Kb) Lost: 9302 Jitter: 0 RTT: 448 ms
inbound_rtp_audio_1
A/V sync: 0 ms Jitter-buffer delay: 523 ms
Local: 16:02:56 GMT-0700 (PDT) inboundrtp SSRC: 4184994738 Received: 16256 packets (884.56 Kb) Lost: 0 Jitter: 0
Remote: 16:02:52 GMT-0700 (PDT) outboundrtp SSRC: 4184994738 Sent: 17644 packets (685.2 Kb)
inbound_rtp_video_2
Decoder: Avg. bitrate: 0.02 Mbps (0.22 SD) Avg. framerate: 0.02 fps (0.26 SD) Discarded packets: 0
Local: 16:02:56 GMT-0700 (PDT) inboundrtp SSRC: 3135148106 Received: 33848 packets (35580.11 Kb) Lost: 8925 Jitter: 9.315
Remote: 16:02:55 GMT-0700 (PDT) outboundrtp SSRC: 3135148106 Sent: 42788 packets (44687.62 Kb)

Side that accepted the call:

PeerConnection:1412722616374064 (id=25 url=https://dev-2.wbx2.com/) 16:03:13 GMT-0700 (PDT)
ICE statistics
1412722616374064 (id=25 url=https://dev-2.wbx2.com/): stream1/audio
Local candidate	Remote candidate	ICE State	Priority	Nominated	Selected
192.168.1.2:60124/udp(host)	166.78.47.126:33548/udp(host)	failed	9149908167411368000
166.137.214.117:60604/udp(peerreflexive)	166.78.47.126:33548/udp(host)	succeeded	7996986662804521000	*	*
Local candidate addr	Type
192.168.1.2:60124/udp	host
166.137.214.117:41544/udp	serverreflexive
166.137.214.117:49661/udp	peerreflexive
Remote candidate addr	Type
166.78.47.126:33548/udp	host
1412722616374064 (id=25 url=https://dev-2.wbx2.com/): stream2/video
Local candidate	Remote candidate	ICE State	Priority	Nominated	Selected
192.168.1.2:61927/udp(host)	166.78.47.126:33440/udp(host)	failed	9149908167411368000
166.137.214.117:60604/udp(peerreflexive)	166.78.47.126:33440/udp(host)	succeeded	7996986662804521000	*	*
Local candidate addr	Type
192.168.1.2:61927/udp	host
166.137.214.117:43694/udp	serverreflexive
166.137.214.117:60604/udp	peerreflexive
Remote candidate addr	Type
166.78.47.126:33440/udp	host
Local SDP

v=0

o=Mozilla-SIPUA-33.0 16753 0 IN IP4 0.0.0.0

s=SIP Call

t=0 0

a=ice-ufrag:04aed478

a=ice-pwd:94b9fbe2fd56549ffce840d486227520

a=fingerprint:sha-256 72:18:C6:92:B3:37:68:A8:EA:3D:10:F1:F6:BA:FA:23:77:15:9D:FE:E2:61:FF:CC:E9:81:08:F0:0F:2D:BE:5D

m=audio 41544 RTP/SAVPF 109 0 8 101

c=IN IP4 166.137.214.117

a=rtpmap:109 opus/48000/2

a=ptime:20

a=rtpmap:0 PCMU/8000

a=rtpmap:8 PCMA/8000

a=rtpmap:101 telephone-event/8000

a=fmtp:101 0-15

a=sendrecv

a=extmap:1 urn:ietf:params:rtp-hdrext:ssrc-audio-level

a=setup:actpass

a=candidate:0 1 UDP 2130379007 192.168.1.2 60124 typ host

a=candidate:1 1 UDP 1694236671 166.137.214.117 41544 typ srflx raddr 192.168.1.2 rport 60124

a=candidate:0 2 UDP 2130379006 192.168.1.2 57558 typ host

a=candidate:1 2 UDP 1694236670 166.137.214.117 48694 typ srflx raddr 192.168.1.2 rport 57558

a=rtcp-mux

m=video 43694 RTP/SAVPF 120 126 97

c=IN IP4 166.137.214.117

a=rtpmap:120 VP8/90000

a=rtpmap:126 H264/90000

a=fmtp:126 profile-level-id=42e01f;packetization-mode=1

a=rtpmap:97 H264/90000

a=fmtp:97 profile-level-id=42e01f

a=sendrecv

a=rtcp-fb:120 nack

a=rtcp-fb:120 nack pli

a=rtcp-fb:120 ccm fir

a=rtcp-fb:126 nack

a=rtcp-fb:126 nack pli

a=rtcp-fb:126 ccm fir

a=rtcp-fb:97 nack

a=rtcp-fb:97 nack pli

a=rtcp-fb:97 ccm fir

a=setup:actpass

a=candidate:0 1 UDP 2130379007 192.168.1.2 61927 typ host

a=candidate:1 1 UDP 1694236671 166.137.214.117 43694 typ srflx raddr 192.168.1.2 rport 61927

a=candidate:0 2 UDP 2130379006 192.168.1.2 64821 typ host

a=candidate:1 2 UDP 1694236670 166.137.214.117 56949 typ srflx raddr 192.168.1.2 rport 64821

a=rtcp-mux

Remote SDP

v=0

o=linus 0 0 IN IP4 166.78.47.126

s=-

t=0 0

a=ice-lite

m=audio 33548 RTP/AVP 109

c=IN IP4 166.78.47.126

b=TIAS:64000

a=sendrecv

a=rtpmap:109 opus/48000/2

a=fmtp:109 PROFILE=0;LEVEL=0;packetization-mode=0;level-asymmetry-allowed=1;parameter-add=1;maxaveragebitrate=64000;usedtx=0;stereo=1;useinbandfec=0;cbr=0

a=rtcp-mux

a=label:100

a=ice-ufrag:YKUmCYoF

a=ice-pwd:NAqEOvtNauhF+Q/OkxY9Jn+4s6ANEWOQ

a=candidate:0 1 UDP 2130706431 166.78.47.126 33548 typ host

a=candidate:8 1 TCP 1962934271 166.78.47.126 33548 typ host tcptype passive

a=setup:passive

a=fingerprint:sha-1 49:5F:57:B5:99:ED:31:0A:05:AF:C6:AE:93:AA:34:02:3E:5B:FF:85

m=video 33440 RTP/AVP 126

c=IN IP4 166.78.47.126

b=TIAS:256000

a=sendrecv

a=rtpmap:126 H264/90000

a=fmtp:126 PROFILE=0;LEVEL=0;profile-level-id=42000C;packetization-mode=1;level-asymmetry-allowed=1;max-mbps=27600;max-fs=920;max-br=1000;parameter-add=1;usedtx=0;stereo=0;useinbandfec=0;cbr=0

a=rtcp-fb:* nack pli

a=rtcp-fb:* ccm fir

a=rtcp-mux

a=label:200

a=ice-ufrag:M1icM74s

a=ice-pwd:r73dtAvkOH6BOpctajBt+oa8JlD9Gnwa

a=candidate:0 1 UDP 2130706431 166.78.47.126 33440 typ host

a=candidate:8 1 TCP 1962934271 166.78.47.126 33440 typ host tcptype passive

a=setup:passive

a=fingerprint:sha-1 49:5F:57:B5:99:ED:31:0A:05:AF:C6:AE:93:AA:34:02:3E:5B:FF:85

RTP statistics
outbound_rtp_audio_0
Local: 16:03:13 GMT-0700 (PDT) outboundrtp SSRC: 4184994738 Sent: 18699 packets (1128.03 Kb)
Remote: 16:03:13 GMT-0700 (PDT) inboundrtp SSRC: 4184994738 Received: 16840 packets (641.37 Kb) Lost: 1528 Jitter: 0 RTT: 720 ms
outbound_rtp_video_1
Encoder: Avg. bitrate: 1.07 Mbps (0.37 SD) Avg. framerate: 15.83 fps (1.21 SD) Dropped frames: 4294946714
Local: 16:03:13 GMT-0700 (PDT) outboundrtp SSRC: 3135148106 Sent: 44965 packets (47955.65 Kb)
Remote: 16:03:13 GMT-0700 (PDT) inboundrtp SSRC: 3135148106 Received: 36214 packets (37805.44 Kb) Lost: 8532 Jitter: 0 RTT: 1524 ms
inbound_rtp_audio_1
A/V sync: 0 ms Jitter-buffer delay: 60 ms
Local: 16:03:13 GMT-0700 (PDT) inboundrtp SSRC: 1830038791 Received: 17076 packets (929.07 Kb) Lost: 0 Jitter: 0
Remote: 16:03:05 GMT-0700 (PDT) outboundrtp SSRC: 1830038791 Sent: 19423 packets (754.24 Kb)
inbound_rtp_video_2
Decoder: Avg. bitrate: 0.00 Mbps (0.00 SD) Avg. framerate: 0.00 fps (0.00 SD) Discarded packets: 0
Local: 16:03:13 GMT-0700 (PDT) inboundrtp SSRC: 1400586114 Received: 47120 packets (50653.75 Kb) Lost: 9686 Jitter: 4.641
Remote: 16:03:11 GMT-0700 (PDT) outboundrtp SSRC: 1400586114 Sent: 58298 packets (61890.17 Kb)
Ok, H.264 mode 1 was selected at both ends.

from the code (by inspection), a=rtcp-fb:* should work.

I'll also note the profile-level-id value seems very low (42000c) - 1.2.

I'll need more logs/traces and maybe adding log statements about the NAL sequence received.

signaling:5,gmp:6 would be a good start
Attached file nsprlogswin.tar.bz2 (obsolete) —
Logs from a successful but choppy call.  Initiating side, Windows with signaling:5,GMP:6,webrtc_trace:65535
Attached file nsprlogsosx.tar.bz2 (obsolete) —
Logs from a successful, but choppy call.  Answering side, OSX with signaling:5,GMP:5,webrtc_trace:65535
Also from the run that created those logs.  Both sides had the message "PPS id is invalid" many times in the console which comes from here:
https://github.com/cisco/openh264/blob/v1.1-Firefox33/codec/decoder/core/src/decoder_core.cpp#L539
Logs with pcap of both sides put up here:
https://drive.google.com/folderview?id=0B3jogUA4XhcTOE9ubDBsb044VjA&usp=sharing

Send me a note with your google account ID if you want to view and I'll add you to the list.
Attachment #8501805 - Attachment is obsolete: true
Attachment #8501806 - Attachment is obsolete: true
nspr log of crash on OSX with signaling:6,GMP:5,webrtc_trace:65535,timestamp and WEBRTC_TRACE_FILE=nspr
This crash appears to not be caused by a particular incoming RTP stream, but rather packet loss.  It showed up in this particular application because all of the video streams are being routed through a server and most of our other tests are peer-to-peer.

On this line we are setting the video engine to continue decoding a frame with errors:
http://dxr.mozilla.org/mozilla-central/source/media/webrtc/trunk/webrtc/video_engine/vie_channel.cc#175
This is why we were getting errors like "PPS id is invalid" from the OpenH264 decoder during high packet loss.

In a packet loss situation we can then end up with a zero-length frame coming out of VCMJitterBuffer::ExtractAndSetDecode.   WebrtcGmpVideoDecoder::Decode_g will then crash on a zero-length frame.

Three possible places we could fix this (there are probably more):
1. Always set decode error mode to kNoErrors for H.264
2. Have WebrtcGmpVideoDecoder::Decode_g return an error on zero-length frames.
3. Have VCMReceiver::FrameForDecoding return NULL instead of any zero-length frames.
(In reply to Ethan Hugg [:ehugg] from comment #22)
> Three possible places we could fix this (there are probably more):
> 1. Always set decode error mode to kNoErrors for H.264
> 2. Have WebrtcGmpVideoDecoder::Decode_g return an error on zero-length
> frames.
> 3. Have VCMReceiver::FrameForDecoding return NULL instead of any zero-length
> frames.

Maybe we should do all 3 for extra safety in the short term.

#3 might need to be specific for H.264. While I can’t imagine why video engine would think it makes sense to invoke decode of an empty frame for any codec, it seems safest to leave other codecs alone since they may have some specific hacks or odd behavior that relies on empty frames.
Attachment #8502210 - Attachment is obsolete: true
I would prefer to simply catch them in Decode_g (#2) which is what the attached patch does.  This has very little chance of side effects so the testing would be simpler and it would be easier to uplift.
(In reply to Ethan Hugg [:ehugg] from comment #24)
That should be sufficient. Any of them alone should be sufficient. More than one is just healthy paranoia. If I had to pick one, #1 seems to catch more things earliest. But again, any of them should be sufficient.
DecodeWithErrors actually has a purpose, especially with mode 0 - it could pass all the "good" NALs through and you'd get a partial update instead of total freeze.  For mode 1, IF you you only include data up to the first missing packet, you also could get a partial decode.  

However - what I think would work better (at least for h.264) here is "decode good frames even if we skipped a frame due to loss".  In talking-head usage, this rarely causes major visual bugs and recovery wipes any small ones away in 0.1-0.3 seconds; often one barely would notice unless the screen is large and bandwidth is high.

kNoErrors (and disabling dual-decode) implies "freeze on error"; I don't like it.  Really there should be an explicit "session->ExtractPacketWithErrors())" that is codec-specific.  As it stands we *cannot* allow decoding with errors because of how Mode 1 NALs are accumulated in the buffer (not to mention it doesn't check HaveFirstPacket()).  If the "dual" decoder used "kDecodeAllCompleteFrames" (i.e. skip frames with errors), that would be better.

kSelectiveErrors would be smarter to use than kWithErrors or kNoErrors, but it has an evil test where it won't decode a frame-with-loss if it has received between between 0.2 and 0.8 of the "average" number of packets/frame.  At least it does check HaveFirstPacket.  I'll note nothing ever sets kSelectiveErrors right now.  So I think that's out for now, OR we modify it to decode complete frames only (but not stall on error).
Ethan - can you give that a try, without the 0-length-buffer check (and with it)?

I think this might yield a better experience (beyond just "not crashing").  To land, I'd take this and #2 or #3 as a safety valve (if this is better than #1, which would be a bit painful to do for H.264 only).

We could also run this one where we leave it kWithErrors, but change the code for kWithErrors from decodable_ = true to decodable_ = (is_h264 ? complete_ : true).
OK, I ran a bunch of tests

1. Without either patch it crashes within the first minute.
2. with the 0-length-frame patch it breaks up, freezes, recovers every few seconds but doesn't crash.  Ran for over 20 mins.
3. With the selective errors patch it doesn't crash after over 20 mins but seems less able to recover.  Part of the screen will stay wrong for a long time and eventually it won't recover at all and just freeze.
4. With both patches the result is the same as #3.

I ran #3 a bunch of times thinking it might be chance, but it really seems to recover differently.
I think this is the 'right' fix, plus comments as to why.  Basically, I found that session_info.cc MakeDecodable() strips partial NALs and packets after a mising one (good!) but if the first packet of a NAL was missing it strips the entire NAL, which can leave 0 bytes to decode.  I would land this and the patch to filter them at the GMP decoder as belt-and-suspenders.  Ethan, can you try this alone and see if it's equivalent to your existing patch?
Attachment #8505253 - Flags: review?(ethanhugg)
Comment on attachment 8501415 [details] [diff] [review]
Check for zero length input before decode

Review of attachment 8501415 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp
@@ +591,5 @@
>      // destroyed via Terminate()
>      return WEBRTC_VIDEO_CODEC_ERROR;
>    }
>  
> +  if (!aInputImage._length) {

As I usually only use ! for booleans and pointers, I'll likely land this as "if (aInputImage._length == 0)"
Attachment #8501415 - Flags: review+
Comment on attachment 8505253 [details] [diff] [review]
Don't return 0-length frames for decoding; comments

Review of attachment 8505253 [details] [diff] [review]:
-----------------------------------------------------------------

Tested this patch alone as well as both patches together and they worked as expected.  No crashes after 20mins and video periodically recovers.
Attachment #8505253 - Flags: review?(ethanhugg) → review+
Try run based on 33 primarily for the wider team to test:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e0da21982b46
Comment on attachment 8505253 [details] [diff] [review]
Don't return 0-length frames for decoding; comments

Review of attachment 8505253 [details] [diff] [review]:
-----------------------------------------------------------------

Changes look good to fix the crash. However, MakeDecodable has more problems, at least for the OpenH264 decoder. It assumes the decoder can always handle truncated NALs reliably. Some can, most can't. The OpenH264 decoder can't, and therefore should only be fed complete NALs. That would be safest for OMX HW decoders as well. I'll file a separate bug for this.
Attachment #8505253 - Flags: review+
Treeherder seems unable to find this try run so here it is on tbpl:
https://tbpl.mozilla.org/?tree=Try&rev=e0da21982b46
(In reply to mzanaty from comment #34)
> Changes look good to fix the crash. However, MakeDecodable has more
> problems, at least for the OpenH264 decoder. It assumes the decoder can
> always handle truncated NALs reliably. Some can, most can't. The OpenH264
> decoder can't, and therefore should only be fed complete NALs. That would be
> safest for OMX HW decoders as well. I'll file a separate bug for this.

Ethan's test shows it must handle truncated NALs fairly well (his test is mode 1, so tons of NALs are getting truncated).  An earlier test he did for me where I would only mark frames as decodable if they were complete worked, but not as well.  (For mode 1, the only place where that would be different than modifying MakeDecodable to be whole-nals would be on iframes, maybe).
[Tracking Requested - why for this release]:
Important crash fix for a crash that can happen at any time in an h.264 call (if you're unlucky on packet loss).

Will also want to ride-along in any 33 updates
(In reply to Randell Jesup [:jesup] from comment #37)
Verified the OpenH264 v1.1 decoder can indeed reliably handle truncated NALs. So MakeDecodable works fine and no separate bug needed. (At least for OpenH264; can't speak for OMX HW decoders which may be hard to generalize.)
https://hg.mozilla.org/mozilla-central/rev/8a1aa0399dca
https://hg.mozilla.org/mozilla-central/rev/adde0b51f5b2
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla36
Comment on attachment 8505253 [details] [diff] [review]
Don't return 0-length frames for decoding; comments

Would like to see this in 33.0.x if we build one

Approval Request Comment
[Feature/regressing bug #]: H264 support/OpenH264

[User impact if declined]: Random crashes if there's any packet loss not recovered with NACK

[Describe test coverage new/current, TBPL]: We have no tests for real network conditions in our infra yet.  Manually tested by Cisco on a connection with ~50% packet loss and 1s RTT, where it normally crashes within a minute.

[Risks and why]: Extremely low risk.  If we want no risk at all, we can take just the "ignore 0-length frames" patch alone on release, but the risk of the full fix here is virtually 0.

[String/UUID change made/needed]: None
Attachment #8505253 - Flags: approval-mozilla-release?
Attachment #8505253 - Flags: approval-mozilla-beta?
Attachment #8505253 - Flags: approval-mozilla-aurora?
Comment on attachment 8505253 [details] [diff] [review]
Don't return 0-length frames for decoding; comments

Approved for Beta and Aurora. Sylvestre will need to make a call on including this in any point release.
Attachment #8505253 - Flags: approval-mozilla-beta?
Attachment #8505253 - Flags: approval-mozilla-beta+
Attachment #8505253 - Flags: approval-mozilla-aurora?
Attachment #8505253 - Flags: approval-mozilla-aurora+
Comment on attachment 8505253 [details] [diff] [review]
Don't return 0-length frames for decoding; comments

I didn't want to take this in 33.0.1 to make sure it is only about OMTC & top crashes.

However, we are probably going to make a 33.0.2 and I am going to take this one.
Can this be landed in GECKO330_2014101104_RELBRANCH too?
Thanks
Attachment #8505253 - Flags: approval-mozilla-release? → approval-mozilla-release+
Do you need help from the Manual QA team to verify this fix for 33.0.1? If yes, can you also please provide some clear instructions to test?
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #46)
> Do you need help from the Manual QA team to verify this fix for 33.0.1? If
> yes, can you also please provide some clear instructions to test?
From https://hg.mozilla.org/releases/mozilla-release/

It seems 33.0.1 build1 doesn't contain the fix.
(In reply to mzanaty from comment #39)
> (In reply to Randell Jesup [:jesup] from comment #37)
> Verified the OpenH264 v1.1 decoder can indeed reliably handle truncated
> NALs. So MakeDecodable works fine and no separate bug needed. (At least for
> OpenH264; can't speak for OMX HW decoders which may be hard to generalize.)

Is there a way for other people to verify this, or does this require special hardware/devices (that's what it sounded like from the initial report)?
Flags: needinfo?(mzanaty)
This does not require special hardware or devices, simply a case where there is lots of packet loss.  From the duplicate bug above you can see we are not the only ones that ran into this.   You need to have enough packet loss so that you run into the right combination of packets to cause this.   Ping me on IRC if you'd like to duplicate my test scenario.
Flags: needinfo?(mzanaty)
(In reply to Nils Ohlmeier [:drno] from comment #49)
This can be verified by inducing specific packet loss patterns between two Firefox browsers in a WebRTC H.264 call. The first packet(s) of a frame must be lost ("leading loss") to reproduce the crash. With high enough packet loss rates, this may happen within seconds or minutes. To verify the fix, video must continue and recover despite losing the first packet(s) of a frame which can be verified in wireshark.
You need to log in before you can comment on or make changes to this bug.