Closed Bug 1217677 Opened 9 years ago Closed 8 years ago

Video Packet drop when sending video at 1920 Kbps

Categories

(Core :: WebRTC, defect, P1)

41 Branch
All
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
firefox47 --- fixed

People

(Reporter: rdaware, Assigned: drno)

References

Details

Attachments

(16 files, 1 obsolete file)

872.85 KB, image/jpeg
Details
295.52 KB, image/jpeg
Details
56 bytes, text/plain
Details
2.37 MB, text/x-log
Details
525.13 KB, text/x-log
Details
90.07 KB, image/png
Details
541.45 KB, text/x-log
Details
2.34 MB, text/x-log
Details
6.96 MB, application/octet-stream
Details
104.91 KB, image/png
Details
3.94 MB, text/x-log
Details
4.45 MB, application/octet-stream
Details
2.58 MB, application/x-zip-compressed
Details
4.77 MB, application/octet-stream
Details
250.16 KB, image/png
Details
58 bytes, text/x-review-board-request
mcmanus
: review+
jesup
: review+
Details
Attached image Wireshark
User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36

Steps to reproduce:

Receive video at 1920 Kbps


Actual results:

Video has low frame rate and video lag. We can see packet drop in about:webrtc while wireshark trace shows no packet drop. Screenshots are attached. This problem is observed only when PC connected to WiFi


Expected results:

video is smooth
Attached image about:webRTC
Attachment #8677808 - Attachment description: screenshot1.jpg → Wireshark
Attached file Wireshark dump
Severity: normal → critical
OS: Unspecified → Windows
Priority: -- → P1
Hardware: Unspecified → All
Component: Untriaged → WebRTC
Product: Firefox → Core
please collect logs from webrtc (see https://wiki.mozilla.org/Media/WebRTC/Logging)

Looking for the webrtc_trace:65535 logs (set NSPR_LOG_MODULES, and WEBRTC_TRACE_FILE).

Note that with jitter buffers, especially on WiFi, packets arriving (or being processed) "too late" will be reported as dropped (though wireshark will show them).  Also: why are you specifying 1920Kbps in the title of this bug?  Is that a measurement, or some type of setting?  Actual bitrate is quite variable especially on Wifi and controlled with congestion control.
Flags: needinfo?(rdaware)
Hi Randell,

We observe this issue only on high bandwidth Wifi network on Windows. Video is smooth on average bandwidth network and we do not see any packet drop. But this increases when bandwidth is high. A lot of packets get dropped by firefox and we see video lag. By specifying 1920 Kbps means browser sends offer to receive 1920 Kbps. I am getting you the logs soon.
Flags: needinfo?(rdaware)
(In reply to rdaware from comment #4)
> Hi Randell,
> 
> We observe this issue only on high bandwidth Wifi network on Windows. Video
> is smooth on average bandwidth network and we do not see any packet drop.
> But this increases when bandwidth is high. A lot of packets get dropped by
> firefox and we see video lag. By specifying 1920 Kbps means browser sends
> offer to receive 1920 Kbps. I am getting you the logs soon.

An offer to receive 1920 Kbps - ok, but the default is no limit.  We do limit sending rates to ~1.3Mbps for 640x480x30 video (max and min rate is based on camera resolution selected (default is 640x480) and framerate.)

If the other (sending) end is not Firefox (and I'm pretty sure it isn't), then realize it should start low, and ramp up to the max rate, and use congestion control.  It should not simply send at the max rate accepted.  The Wireshark trace is the middle of a call, so I can't see the start, but it looks like fixed-bandwidth to me.
(In reply to Randell Jesup [:jesup] from comment #5)
> (In reply to rdaware from comment #4)
> > Hi Randell,
> > 
> > We observe this issue only on high bandwidth Wifi network on Windows. Video
> > is smooth on average bandwidth network and we do not see any packet drop.
> > But this increases when bandwidth is high. A lot of packets get dropped by
> > firefox and we see video lag. By specifying 1920 Kbps means browser sends
> > offer to receive 1920 Kbps. I am getting you the logs soon.
> 
> An offer to receive 1920 Kbps - ok, but the default is no limit.  We do
> limit sending rates to ~1.3Mbps for 640x480x30 video (max and min rate is
> based on camera resolution selected (default is 640x480) and framerate.)
> 
> If the other (sending) end is not Firefox (and I'm pretty sure it isn't),
> then realize it should start low, and ramp up to the max rate, and use
> congestion control.  It should not simply send at the max rate accepted. 
> The Wireshark trace is the middle of a call, so I can't see the start, but
> it looks like fixed-bandwidth to me.

The other end is not firefox. I am sending you links to wireshark traces on email.
(In reply to Randell Jesup [:jesup] from comment #5)
> (In reply to rdaware from comment #4)
> > Hi Randell,
> > 
> > We observe this issue only on high bandwidth Wifi network on Windows. Video
> > is smooth on average bandwidth network and we do not see any packet drop.
> > But this increases when bandwidth is high. A lot of packets get dropped by
> > firefox and we see video lag. By specifying 1920 Kbps means browser sends
> > offer to receive 1920 Kbps. I am getting you the logs soon.
> 
> An offer to receive 1920 Kbps - ok, but the default is no limit.  We do
> limit sending rates to ~1.3Mbps for 640x480x30 video (max and min rate is
> based on camera resolution selected (default is 640x480) and framerate.)
> 
> If the other (sending) end is not Firefox (and I'm pretty sure it isn't),
> then realize it should start low, and ramp up to the max rate, and use
> congestion control.  It should not simply send at the max rate accepted. 
> The Wireshark trace is the middle of a call, so I can't see the start, but
> it looks like fixed-bandwidth to me.

Hi Randell,
I have sent you logs and traces by email. Please check. might have gone in spam.
Priority: P1 → --
I'll look more deeply at this soon.  Any additional characterization you can do would help.  Can you try lower bitrates, and see if there's a 'cut-off' rate?  Also, verify to quality of the wifi connection (not just strength, but also cross-traffic and jitter).  Does this happen in a different wifi environment - workplace wifi is very different than residential)
Please see Comment 8 from Randell: 
"I'll look more deeply at this soon.  Any additional characterization you can do would help.  Can you try lower bitrates, and see if there's a 'cut-off' rate?  Also, verify to quality of the wifi connection (not just strength, but also cross-traffic and jitter).  Does this happen in a different wifi environment - workplace wifi is very different than residential)?"
Flags: needinfo?(rdaware)
Hi Randell,

I will clear the issue here. This problem is observed in high bandwidth wifi and wired environments. If the video bitrate received is more than 1920 kbps, we see heavy packet drop from firefox. this causes bad video. We do not observe this problem when received video bitrate is less than 1000 Kbps.
Flags: needinfo?(rdaware)
Just following up on our conversation from the call today:  We can't reproduce this, and we're looking for more info to help us identify the root cause of the problem.  If anyone can repro this -- especially across multiple machines and/or multiple networks, please chime in with the info and STRs you have.  Thanks!
(In reply to Maire Reavy [:mreavy] (Plz ni?:mreavy) from comment #11)
> Just following up on our conversation from the call today:  We can't
> reproduce this, and we're looking for more info to help us identify the root
> cause of the problem.  If anyone can repro this -- especially across
> multiple machines and/or multiple networks, please chime in with the info
> and STRs you have.  Thanks!

Unfortunately, we are not sure how we could help on setting up environment for reproducing this issue. We can easily reproduce this issue in our development environment. Would it be possible for any of your engineer to visit us personally? We are located in Milpitas, CA. It would be faster to solve. I would like to manage this meeting if this is possible. Thanks
(In reply to rdaware from comment #12)
> Unfortunately, we are not sure how we could help on setting up environment
> for reproducing this issue. We can easily reproduce this issue in our
> development environment. Would it be possible for any of your engineer to
> visit us personally? We are located in Milpitas, CA. It would be faster to
> solve. I would like to manage this meeting if this is possible. Thanks

@Rdaware: Nils believes he can stop by your office later this week.  Can you reach out to him directly to coordinate an exact time and location to meet? Reaching out by email (nohlmeier@mozilla.com) or on irc (his irc nick in #media is drno) is probably most efficient. Thanks.
Flags: needinfo?(rdaware)
(In reply to Maire Reavy [:mreavy] (Plz ni?:mreavy) from comment #13)
> (In reply to rdaware from comment #12)
> > Unfortunately, we are not sure how we could help on setting up environment
> > for reproducing this issue. We can easily reproduce this issue in our
> > development environment. Would it be possible for any of your engineer to
> > visit us personally? We are located in Milpitas, CA. It would be faster to
> > solve. I would like to manage this meeting if this is possible. Thanks
> 
> @Rdaware: Nils believes he can stop by your office later this week.  Can you
> reach out to him directly to coordinate an exact time and location to meet?
> Reaching out by email (nohlmeier@mozilla.com) or on irc (his irc nick in
> #media is drno) is probably most efficient. Thanks.

Many Thanks to you Maire. I would co-ordinate with Nils and setup this meeting.
Flags: needinfo?(rdaware)
Preliminary result from looking at the logs and traces I took today: 

First a clarification: the WiFi in which this happens is high speed enterprise WiFi with no known network issues. And neither traces nor logs indicate that real pack loss has anything to do with the problem.

During my visit Firefox was not rendering the remote video at all.

From the trace it clear that the other side is definitely not ramping up the connection. We receive an initial RTCP sender report followed right away by 6 RTP packets with the same time stamp (together over 4k bytes of data). And this continues through the trace. No slowing down.

NSPR log shows nothing unusual. But webrtc trace log is full of errors. I think for all RTP time stamps in the wireshark trace I see error like this in the webrtc trace log:
  Failed to decode frame with timestamp 2213416477, error code: -1
Which obviously explains why nothing gets rendered.

I'll share the data I got today with Randell as it's big and I'm not sure if Cisco is okay with making it public in this bug here.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee: nobody → rjesup
Rank: 15
Priority: -- → P1
(In reply to Nils Ohlmeier [:drno] from comment #15)
> Preliminary result from looking at the logs and traces I took today: 
> 
> First a clarification: the WiFi in which this happens is high speed
> enterprise WiFi with no known network issues. And neither traces nor logs
> indicate that real pack loss has anything to do with the problem.
> 
> During my visit Firefox was not rendering the remote video at all.
> 
> From the trace it clear that the other side is definitely not ramping up the
> connection. We receive an initial RTCP sender report followed right away by
> 6 RTP packets with the same time stamp (together over 4k bytes of data). And
> this continues through the trace. No slowing down.
> 
> NSPR log shows nothing unusual. But webrtc trace log is full of errors. I
> think for all RTP time stamps in the wireshark trace I see error like this
> in the webrtc trace log:
>   Failed to decode frame with timestamp 2213416477, error code: -1
> Which obviously explains why nothing gets rendered.
> 
> I'll share the data I got today with Randell as it's big and I'm not sure if
> Cisco is okay with making it public in this bug here.

Hi Nils,

It is fine to make those logs public. Could you please provide an update on this issue?

Regards,
Rahul
I only took a brief look at the logs and traces as described above, because I assumed that Randel would be in a better position to analyze this problem.
Randel, did you have a chance to look at the logs we gathered and send you?
Flags: needinfo?(rjesup)
I've looked.

The packet sequence looks just plain bad as an interactive/realtime video stream - starts at ~1.5-2Mbps with no ramp up.  Worse yet, it doesn't appear to start with a SPS/PPS/I-frame; the first likely SPS/PPS is packet 270.  68 and 76-byte packets (SPS/PPS presumably) occur every 2 seconds; I infer the encoder is set to periodic IDRs @ ~2 seconds

That said, the packets are received, and if there is a valid IDR at packet 270 it should have decoded.  I see numerous "failed to decode" errors, as noted, which seems to be the major issue.  I see no indication that any frames were decoded or output, though I don't see errors reported for every timestamp - but it may be that the decoder is silently eating some frames due to lack of a decoded IDR.

BTW, from the logs, Firefox appears to be 44 or earlier.

I also note there's no audio, so this is a free-running video stream.

Can you capture logs with NSPR_LOG_MODULES=gmp:5,webrtc_trace:65535,signaling:5,timestamp WEBRTC_TRACE_FILE=webrtc.log NSPR_LOG_FILE=nspr.log

That will give us more info about the GMP (OpenH264) decoding traffic.  However, it overall appears as if OpenH264 dislikes the data for some reason.  Capturing an iframe (not a pframe!) in a debugger being fed in and failing to decode would help figure it out, though turning on debug logs in OpenH264 may be enough.

Adding Hank to the bug...

We could make a tweaked build to dump NALs that fail to decode to a file, though it would be a minor annoyance.
Flags: needinfo?(rjesup)
Can you get me the logs mentioned in comment 18?  Thanks
Flags: needinfo?(rdaware)
(In reply to Randell Jesup [:jesup] from comment #18)
> I've looked.
> 
> The packet sequence looks just plain bad as an interactive/realtime video
> stream - starts at ~1.5-2Mbps with no ramp up.  Worse yet, it doesn't appear
> to start with a SPS/PPS/I-frame; the first likely SPS/PPS is packet 270.  68
> and 76-byte packets (SPS/PPS presumably) occur every 2 seconds; I infer the
> encoder is set to periodic IDRs @ ~2 seconds
> 
> That said, the packets are received, and if there is a valid IDR at packet
> 270 it should have decoded.  I see numerous "failed to decode" errors, as
> noted, which seems to be the major issue.  I see no indication that any
> frames were decoded or output, though I don't see errors reported for every
> timestamp - but it may be that the decoder is silently eating some frames
> due to lack of a decoded IDR.
> 
> BTW, from the logs, Firefox appears to be 44 or earlier.
> 
> I also note there's no audio, so this is a free-running video stream.
> 
> Can you capture logs with
> NSPR_LOG_MODULES=gmp:5,webrtc_trace:65535,signaling:5,timestamp
> WEBRTC_TRACE_FILE=webrtc.log NSPR_LOG_FILE=nspr.log
> 
> That will give us more info about the GMP (OpenH264) decoding traffic. 
> However, it overall appears as if OpenH264 dislikes the data for some
> reason.  Capturing an iframe (not a pframe!) in a debugger being fed in and
> failing to decode would help figure it out, though turning on debug logs in
> OpenH264 may be enough.
> 
> Adding Hank to the bug...
> 
> We could make a tweaked build to dump NALs that fail to decode to a file,
> though it would be a minor annoyance.

Hi Randell, I am emailed you new logs.

Thanks,
Rahul
Flags: needinfo?(rdaware)
Hi Rahul,
Please forward the logs to me (hankpeng@cisco.com) as well.
Flags: needinfo?(rdaware)
(In reply to Hank Peng from comment #21)
> Hi Rahul,
> Please forward the logs to me (hankpeng@cisco.com) as well.

Hi Hank,
I have forwarded you the logs.
Flags: needinfo?(rdaware)
nspr.log from Rahul for comment 18
Got some time to look into the log from Rahul.
According to the trace "WebrtcVideoConduit::ReceivedRTPPacket: seq# " in nspr.log, the min received seq# is 10, the max 10908, the total received sequences are 8525, the missed ones are 2374, thus the average packet drop ratio is 21.78%. The figure of attachment 8722218 [details] shows the drop ratio per each 100 packets as well.

In addition, there are 1156 lines of "GMPVideoDecoderParent[217808e0]::Decode() timestamp=\d+ keyframe=0", while 1148 of "GMPVideoDecoderParent[217808e0]::RecvError(error=7)" and none of "GMPVideoDecoderParent[%p]::RecvDecoded()". Considering so many packet gaps, it is hard to tell if something is wrong in the decoder or not. 

I didn't know why the received sequence number of WebrtcVideoConduit is not continuous. I will try to look into it by reproducing the issue in Rahul's enviroment.
(In reply to Hank Peng from comment #26)
> Got some time to look into the log from Rahul.
> According to the trace "WebrtcVideoConduit::ReceivedRTPPacket: seq# " in
> nspr.log, the min received seq# is 10, the max 10908, the total received
> sequences are 8525, the missed ones are 2374, thus the average packet drop
> ratio is 21.78%. The figure of attachment 8722218 [details] shows the drop
> ratio per each 100 packets as well.
> 
> In addition, there are 1156 lines of
> "GMPVideoDecoderParent[217808e0]::Decode() timestamp=\d+ keyframe=0", while
> 1148 of "GMPVideoDecoderParent[217808e0]::RecvError(error=7)" and none of
> "GMPVideoDecoderParent[%p]::RecvDecoded()". Considering so many packet gaps,
> it is hard to tell if something is wrong in the decoder or not. 
> 
> I didn't know why the received sequence number of WebrtcVideoConduit is not
> continuous. I will try to look into it by reproducing the issue in Rahul's
> enviroment.

That is a very interesting observation. As this time Rahul did not provide network captures I went back to the logs from the attempt when I visited the Cisco office, because for that we have nspr log plus wireshark pcap.
Comparing the sequence numbers from the NSPR log with wireshark tells me that the NSPR log file omits packets which clearly exist on the wire. The on the wire non-existing gaps according to the NSPR log are between 1 and 20 packets (note: I only spot checked that the reported gaps in the pcap file).
(In reply to Nils Ohlmeier [:drno] from comment #27)
> (In reply to Hank Peng from comment #26)
> > Got some time to look into the log from Rahul.
> > According to the trace "WebrtcVideoConduit::ReceivedRTPPacket: seq# " in
> > nspr.log, the min received seq# is 10, the max 10908, the total received
> > sequences are 8525, the missed ones are 2374, thus the average packet drop
> > ratio is 21.78%. The figure of attachment 8722218 [details] shows the drop
> > ratio per each 100 packets as well.
> > 
> > In addition, there are 1156 lines of
> > "GMPVideoDecoderParent[217808e0]::Decode() timestamp=\d+ keyframe=0", while
> > 1148 of "GMPVideoDecoderParent[217808e0]::RecvError(error=7)" and none of
> > "GMPVideoDecoderParent[%p]::RecvDecoded()". Considering so many packet gaps,
> > it is hard to tell if something is wrong in the decoder or not. 
> > 
> > I didn't know why the received sequence number of WebrtcVideoConduit is not
> > continuous. I will try to look into it by reproducing the issue in Rahul's
> > enviroment.
> 
> That is a very interesting observation. As this time Rahul did not provide
> network captures I went back to the logs from the attempt when I visited the
> Cisco office, because for that we have nspr log plus wireshark pcap.
> Comparing the sequence numbers from the NSPR log with wireshark tells me
> that the NSPR log file omits packets which clearly exist on the wire. The on
> the wire non-existing gaps according to the NSPR log are between 1 and 20
> packets (note: I only spot checked that the reported gaps in the pcap file).

Hi Nils,

Do you need new logs with network captures? Would that be helpful for you to compare capture with logs?
Assignee: rjesup → drno
Severity: critical → major
backlog: --- → webrtc/webaudio+
I did some testings with the help of Rahul this morning. The related logs and packet capture are attachment 8722664 [details], attachment 8722665 [details], attachment 8722666 [details]. Some findings are as following.

1) The video sender is Cisco Telepresence endpoint EX90. The video resolution is 720p, the network traffic is about 1.9mbps, and the packet rate is 250 packet/sec. 

2) Firefox 44 on Mac OS X 10.10 and Windows 10 didn't have such problem. The received video was of high quality, very smooth and no visible delay. 

3) Firefox 44 on Windows 7 could reproduce this issue 100 percent. We run the testing many times and tried two different machines running Windows 7. This bugs seems an OS dependent issue. However, we didn't try other Windows versions.

4) On Windows 7, sometimes the receiver can decode the first frame. Other times not even a single frame was decoded, thus the render window is empty, which is the case Nils saw last time when he came to Cisco office. Both about:webrtc and nspr log showed lots of packet loss or dropping. 

5) We did another test to make the sender be a WebEx Meeting client, thus the bitrate and packet rate would be lower than EX90 when sending 720p. There was still packet dropping on the transport layer of the receiver side, but the drop ratio was much lower. The video could be seen most of time with poor quality and sometimes it experienced a long time freeze. 

6) Based on our testing, we are pretty sure the problem lies in the transport layer of Firefox. In the latest uploaded logs, the packet capture file showed no single packet loss on network, but the nspr log indicated about 12.67% packets were dropped by the transport layer before dispatching them to WebrtcVideoConduit. So, it seems the throughput of the transport layer on Windows 7 needs to be investigated.
Maybe it is worth mentioning that the performance of machine running Windows 7 shouldn't be a problem. Rahul's laptop is ThinkPad X1 Carbon 3rd, with the 4-core i7-5600U CPU @2.60GHz and equipped with 8 GB memory.
Attached image packet-rate.png
The RTP packet rate: pcap (attachment 8722666 [details]) vs. nspr (attachment 8722665 [details])
Thanks Hank and Rahul for the extensive testing. We think you are right that we first need to find out why nspr show more packet loss then network captures.

Rahul could you please do another test call with the EX90 and your Windows 7 laptop and capture logs with the following environment variable:
 NSPR_LOG_MODULES=mtransport:5,gmp:5,signaling:5,timestamp
 NSPR_LOG_FILE=nspr.log
Flags: needinfo?(rdaware)
Oh and a wireshark trace file together with the log file would be also very helpful (just in case).
Flags: needinfo?(rdaware)
Attached file Traces for comment 36
Thank you Rahul for the super quick turn around time. Unfortunately the logs only tell me that the layers we asked logging really don't see the packets, which means the problem must be in one of the even lower layers.

I'm sorry, but I'm afraid I need at least one more round of logs and wireshark traces:
 NSPR_LOG_MODULES=nsSocketTransport:5,mtransport:5,gmp:5,signaling:5,timestamp
 NSPR_LOG_FILE=nspr.log
Flags: needinfo?(rdaware)
Attached file nspr.zip
Flags: needinfo?(rdaware)
(In reply to rdaware from comment #40)
> Created attachment 8723661 [details]
> nspr.zip

nspr.log for comment 39
Hi Rahul,

quick question: which version of Firefox do or did you use so far?
If you haven't tried it with Firefox 47 yet, which is Nightly right, could I ask you to give it a try with Nightly and check if that improves the situation at all (there is a very slim chance that the fix from bug 1224845 might help here)?

Mostly for documentation purpose:
I looked at the new log file, and basically it shows that even the lowest layers of Firefox, NSPR, in this case does not see/receive the missing packets. Here is an example for one of the gaps in attachment 8723661 [details]:

2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: poll timeout: none
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]:     timeout = -1 milliseconds
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]:     ...returned after 0 milliseconds
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Flow[fe01671624b293f4:0,rtp(none)]; Layer[ice]: PacketReceived(1456423751573000 (id=39 url=https://pj1rahul.eng.webex.com/mw3000/mywebex/cmr/cmrlobby.do?siteurl=pj1rahul&username=rdaware&rnd aLevel=0,1,686)
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Flow[fe01671624b293f4:0,rtp(none)]; Layer[dtls]: PacketReceived(686)
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Flow[fe01671624b293f4:0,rtp(none)]; Layer[dtls]: Receive would have blocked
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Acquiring lock on thread b117d0
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Acquired lock on thread b117d0
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Released lock on thread b117d0
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: [Socket Thread|WebrtcVideoSessionConduit] VideoConduit.cpp:1250: mozilla::WebrtcVideoConduit::ReceivedRTPPacket: seq# 129, Channel 0, Len 676 
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Acquiring lock on thread b117d0
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Acquired lock on thread b117d0
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: Released lock on thread b117d0
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]: STS poll iter [1]
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]:   active [121] { handler=245199fc condition=0 pollflags=1 }
2016-02-25 18:09:13.317000 UTC - 5792[b117d0]:   active [120] { handler=24519b9c condition=0 pollflags=1 }
......
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]:   active [1] { handler=1c6fa540 condition=0 pollflags=5 }
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]:   active [0] { handler=1c6f9830 condition=0 pollflags=5 }
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]:   calling PR_Poll [active=122 idle=0]
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]: poll timeout: none
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]:     timeout = -1 milliseconds
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]:     ...returned after 0 milliseconds
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]: Flow[fe01671624b293f4:0,rtp(none)]; Layer[ice]: PacketReceived(1456423751573000 (id=39 url=https://pj1rahul.eng.webex.com/mw3000/mywebex/cmr/cmrlobby.do?siteurl=pj1rahul&username=rdaware&rnd aLevel=0,1,704)
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]: Flow[fe01671624b293f4:0,rtp(none)]; Layer[dtls]: PacketReceived(704)
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]: Flow[fe01671624b293f4:0,rtp(none)]; Layer[dtls]: Receive would have blocked
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]: Acquiring lock on thread b117d0
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]: Acquired lock on thread b117d0
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]: Released lock on thread b117d0
2016-02-25 18:09:13.319000 UTC - 5792[b117d0]: [Socket Thread|WebrtcVideoSessionConduit] VideoConduit.cpp:1250: mozilla::WebrtcVideoConduit::ReceivedRTPPacket: seq# 145, Channel 0, Len 694 

So the system poll() call returns immediately, and the next packet which delivers to the RTP stack is #145, even though the next expected packet is #130.
Flags: needinfo?(rdaware)
Attached image Packet_burst.png
Here is a theory: as the attached graph shows the packets arrive pretty bursty (between 1packets/100ms up 52packets/100ms). I have to check, but I assume we only use the default receive buffer size on Windows which seems to be 8k. So if we don't read fast enough in case of bursts, the OS simply drops packets. Which would make explain why we fail to get an I frame through to the decoder.
See Also: → 1251821
Attachment #8724522 - Flags: review?(rjesup)
Attachment #8724523 - Flags: review?(rjesup)
https://reviewboard.mozilla.org/r/37061/#review33609

::: media/mtransport/nr_socket_prsock.cpp:605
(Diff revision 1)
> +      opt_rcvbuf.value.recv_buffer_size = 1*1024*1024;

The two options questions from my side are:
1) Would it be worth trying to read the current size first and only increase it if needed (to prevent that this actually decreases on a system with non standard configuration)?
2) What size should we choose for receive buffer? I chose 1MB assuming that such a size should be no big deal on any modern system. But I assume we could probably get away with something smaller.
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

https://reviewboard.mozilla.org/r/37063/#review33635

::: dom/network/UDPSocketParent.cpp:248
(Diff revision 1)
> +  if (recvBufferSize) {

Minor style nit - for numeric values (as opposed to ptrs or bools) I prefer if (foo != 0) to if (foo); I find it clearer.

::: media/mtransport/nr_socket_prsock.cpp:1502
(Diff revision 1)
> +                                    /* recv buffer size */1 * 1024 * 2014))) {

Where does just under 2MB come from?  At minimum we should show why we chose that value.  (is the "just under" to avoid jemalloc rounding it up to the next MB?)

That also may be overkill... and what if someone is sending 4k video to us?  Is it reasonable to dedicate 1MB/socket, including on Android?  (even more so if we need a larger buffer for 4k).

"best" would be to make it adaptive to network arrival rates and network connections.

I'd also like Patrick's thoughts on this.

::: netwerk/base/nsUDPSocket.cpp:1481
(Diff revision 1)
> +  TODO: implement GetSocketOption() and this should work

Is there a bug on it?  If not, open and mark here
Attachment #8724523 - Flags: review?(rjesup)
Comment on attachment 8724522 [details]
MozReview Request: Bug 1217677: increase UDP receive buffer size from default to 1MB

https://reviewboard.mozilla.org/r/37061/#review33637

::: media/mtransport/nr_socket_prsock.cpp:605
(Diff revision 1)
> +      opt_rcvbuf.value.recv_buffer_size = 1*1024*1024;

Why is this different than the other patch?

And I agree that 1MB may not be a "good" number, especially in some conferencing cases.  Do we know which OSes are affected currently in the testcase?

Reading the size may be a good idea.
Attachment #8724522 - Flags: review?(rjesup)
Comment on attachment 8724522 [details]
MozReview Request: Bug 1217677: increase UDP receive buffer size from default to 1MB

f? to mcmanus on the buffer-size issue
Attachment #8724522 - Flags: feedback?(mcmanus)
Attachment #8724523 - Flags: feedback?(mcmanus)
Comment on attachment 8724522 [details]
MozReview Request: Bug 1217677: increase UDP receive buffer size from default to 1MB

https://reviewboard.mozilla.org/r/37061/#review33681

This is all reasonable to me - and as the primary user of that interface I'm happy to have it sync with your requirements.

I will say that I tend to think about buffers in terms of time rather than bytes.. and I generally aim for about 3 second of worst case.. so you've got 8 megabits of buffering at (afaict) 2mbit/s of transfer.. so that's in the ballpark. I'm not usually doing real time things though, so its possible I overbuffer for your use case. But if you have a sense of rate when you do this allocation you might consider sizing it in terms of duration.. fwiw.
Attachment #8724522 - Flags: review+
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

https://reviewboard.mozilla.org/r/37063/#review33683
Attachment #8724523 - Flags: review+
Attachment #8724522 - Flags: review+
Attachment #8724522 - Flags: feedback?(mcmanus)
Attachment #8724522 - Flags: feedback+
Attachment #8724523 - Flags: review+
Attachment #8724523 - Flags: feedback?(mcmanus)
Attachment #8724523 - Flags: feedback+
(In reply to Patrick McManus [:mcmanus] from comment #51)
> Comment on attachment 8724522 [details]
> MozReview Request: Bug 1217677: increase UDP receive buffer size from
> default to 1MB
> 
> https://reviewboard.mozilla.org/r/37061/#review33681
> 
> This is all reasonable to me - and as the primary user of that interface I'm
> happy to have it sync with your requirements.
> 
> I will say that I tend to think about buffers in terms of time rather than
> bytes.. and I generally aim for about 3 second of worst case.. so you've got
> 8 megabits of buffering at (afaict) 2mbit/s of transfer.. so that's in the
> ballpark. I'm not usually doing real time things though, so its possible I
> overbuffer for your use case. But if you have a sense of rate when you do
> this allocation you might consider sizing it in terms of duration.. fwiw.

The rate could vary from <100Kbps (especially when receiving only audio) to 5+Mbps (perhaps well over 5Mbps for 4K video/screensharing over a LAN - some of the Gigabit project people have shown this running).

Also, time is reasonable to think about; 3 seconds of buffering would be unusable for interactive user.  Even 1 second.  500ms is on the border of usability.  The only caveat is that you can get bursts for iframes, but even those must not cause seconds of buffering, and should be paced onto the wire to avoid buffer overflows and drops at the current congestion-control rate.  They may eat (and come in over) a few normal frame times, but again, packets in/time should match the congestion rate more-or-less.

The weakness of that is that on a LAN or other high bandwidth e2e connection, the estimator can't really know how much headroom there is, and should allow higher spikes out in some cases (when we're at the top of the rate window).  I.e. if we use 500-2MBps for say 960x540@30p, if we're at or near 2Mbps estimated, we may (or the congestion controller may) allow a spike above 2 for an iframe.  But that would require reconfiguring the codec in that situation, which is not occurring at present.

The net is we should have less than 1 second of buffering, and in reality probably under 250ms.  Also, anyone throwing data at us at a high rate with no congestion control can cause problems (and not just here), which is what I think is happening at Cisco; they're starting at the top of the window.  

Either we should configure the buffer for the top possible window, or reconfigure it as the inbound/outbound rates change.  Or allocate larger buffers on desktop-ish platforms, and be more aggressive on mobile.
https://reviewboard.mozilla.org/r/37063/#review33635

> Where does just under 2MB come from?  At minimum we should show why we chose that value.  (is the "just under" to avoid jemalloc rounding it up to the next MB?)
> 
> That also may be overkill... and what if someone is sending 4k video to us?  Is it reasonable to dedicate 1MB/socket, including on Android?  (even more so if we need a larger buffer for 4k).
> 
> "best" would be to make it adaptive to network arrival rates and network connections.
> 
> I'd also like Patrick's thoughts on this.

Sorry the 2014 was just butter fingered. Fixed.

> Is there a bug on it?  If not, open and mark here

Adding the code for it right now.
(In reply to Randell Jesup [:jesup] from comment #53)
> Either we should configure the buffer for the top possible window, or
> reconfigure it as the inbound/outbound rates change.  Or allocate larger
> buffers on desktop-ish platforms, and be more aggressive on mobile.

Lets start with: currently we don't know the default buffer sizes for any OS. Searching the net results in hits for TCP sizes, but I don't assume we can assume these apply for UDP sockets as well.
So I'm trying to add GetSocketOption() support for e10s, and then I'll try create a try run with NSPR logging on for the UDP socket to gather the data from our try servers.

Currently it sounds from this bug report here and bug 1251821 and for unknown reasons only Window 7 seems to be affected. It seems like MS added some automagic buffer size adaption for TCP buffers into Vista/7. Maybe they applied that to UDP as well and it bites us here (and nobody reports it as a problem for Vista, because nobody uses Vista) - and MS has fixed/improved that automagic code for >= Win8. But that is all just speculation.

My current patch sets the buffer size for all sockets we create in the process of gathering. That might, depending on the amount of network interfaces and IP address result in a lot of memory usage. My proposal would be to not increase the buffers by default when the sockets get created, but instead let nICEr only request bigger buffers once a pair has been selected. Because currently we don't have any indication that the default buffers effect ICE and/or DTLS negotiation in any way.

Limiting the buffer size request to desktop platforms seems perfectly reasonable to me. If someone gets an un-paced HD or higher resolution stream send to his mobile, he/she will be in for trouble no matter I assume.

Now for the size of the buffer: the network layers (ICE) really don't have any idea about the bandwidth to expect. And any calculation at signaling time gets even more complex if we assume multiple video streams over bundled ICE and then add re-negotiation on top.
In the long term I see two possible alternatives:
A) We let the signaling layers do some calculation and request a min buffer size (assuming the OS grants them), based on the amount of streams, how many are video etc.
B) We let the network layers them self, somewhere in the packet received callbacks, look at the packet rates and packets sizes and then request bigger received buffers from the OS based on the statistics collected from the last N seconds (although we might obviously loose some data with that approach).

But both of these are not easy fixes to the problem at hand here right now. So I think for now we should simply request "some" bigger buffer for sockets of selected pairs on desktops.
(In reply to Nils Ohlmeier [:drno] from comment #55)
> Currently it sounds from this bug report here and bug 1251821 and for
> unknown reasons only Window 7 seems to be affected. It seems like MS added
> some automagic buffer size adaption for TCP buffers into Vista/7. Maybe they
> applied that to UDP as well and it bites us here (and nobody reports it as a
> problem for Vista, because nobody uses Vista) - and MS has fixed/improved
> that automagic code for >= Win8. But that is all just speculation.

So... why not only do this for Win <= 7, or Win7/Vista only?
Flags: needinfo?(drno)
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37063/diff/1-2/
Attachment #8724523 - Attachment description: MozReview Request: Bug 1217677: increase UDP receive buffer size for e10s from default to 1MB → MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup
Attachment #8724523 - Flags: review?(rjesup)
Attachment #8724523 - Flags: review?(mcmanus)
Attachment #8724523 - Flags: feedback+
Attachment #8724522 - Attachment is obsolete: true
As discussed today the fix for now is to increase buffer size to three times the size of HD I frame on <Win8 only.
Flags: needinfo?(drno)
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37063/diff/2-3/
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37063/diff/3-4/
See Also: → 1252769
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

https://reviewboard.mozilla.org/r/37063/#review34139

Note that we probably should have jdm review the udpsocket changes (though it's pretty clear).  If he's not available, jduell or dragana

::: dom/network/UDPSocketParent.cpp:251
(Diff revision 4)
> +      return rv;

I think we should just warn on buffer-size set failure, not fail.  If you think it's really something that shouldn't happen, NS_Assertion or MOZ_ASSERT, but failure to increase buffersize should still let things work.

::: media/mtransport/nr_socket_prsock.cpp:613
(Diff revision 4)
> +        opt_rcvbuf.value.recv_buffer_size = 3 * 21 * 1024;

'losing' (wonders of english spelling)
define 'HD' (1920x1080 I presume)
Explain 3
And with >HD in mind (i.e. screenshares of retina desktops, 4K) let's use 5 for safety.

::: media/mtransport/nr_socket_prsock.cpp:1511
(Diff revision 4)
> +    // receive one HD I frame (~21KB) without loosing packets

clone comment from above, or point people to it
Attachment #8724523 - Flags: review?(rjesup) → review+
Aha, it's up to mcmanus.  That works too :-)
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

https://reviewboard.mozilla.org/r/37063/#review34155
Attachment #8724523 - Flags: review?(mcmanus) → review+
Hi Rahul,

could you please download this Nightly build: http://archive.mozilla.org/pub/firefox/try-builds/drno@ohlmeier.org-cb2fcf460dc2247e70924fbcb00a2533c77b1683/try-win32/firefox-47.0a1.en-US.win32.zip
and check if that fixes the video rendering issue on your Windows 7 laptop?
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37063/diff/4-5/
(In reply to Nils Ohlmeier [:drno] from comment #66)
> Hi Rahul,
> 
> could you please download this Nightly build:
> http://archive.mozilla.org/pub/firefox/try-builds/drno@ohlmeier.org-
> cb2fcf460dc2247e70924fbcb00a2533c77b1683/try-win32/firefox-47.0a1.en-US.
> win32.zip
> and check if that fixes the video rendering issue on your Windows 7 laptop?

Hi Nils,

I tried this build. I still have the issue.
(In reply to rdaware from comment #68)
> I tried this build. I still have the issue.

Thanks for the quick testing. And it is surprising that it did not work.

Lets try a bigger hammer (setting recv buffer to 2MB no matter on which OS). Can you check if this build fixes the issue for you:
  http://archive.mozilla.org/pub/firefox/try-builds/drno@ohlmeier.org-db8a393a2a100b17f0500f6416b24accc94effd0/try-win32/firefox-47.0a1.en-US.win32.zip
(In reply to Nils Ohlmeier [:drno] from comment #70)
> (In reply to rdaware from comment #68)
> > I tried this build. I still have the issue.
> 
> Thanks for the quick testing. And it is surprising that it did not work.
> 
> Lets try a bigger hammer (setting recv buffer to 2MB no matter on which OS).
> Can you check if this build fixes the issue for you:
>  
> http://archive.mozilla.org/pub/firefox/try-builds/drno@ohlmeier.org-
> db8a393a2a100b17f0500f6416b24accc94effd0/try-win32/firefox-47.0a1.en-US.
> win32.zip

Hi Nils,

Thanks for new build. This works fine. Here is stats from webrtc logs:
Local: 16:30:36 GMT-0800 (Pacific Standard Time) inboundrtp SSRC: 807403656 Received: 21309 packets (13864.83 Kb) Lost: 15 Jitter: 2.974.
Flags: needinfo?(rdaware)
Hi Rahul,

thanks for testing it so quickly. Now we know that 100KB is too small - although I sill don't really understand why - and 2MB seems to be enough to fix your problem.
To get to a more reasonable value for the receive buffer I created three different builds for you. It would be great if you test them all and let me know which ones work for you and which don't.

1) Build with 1MB receive buffer: http://archive.mozilla.org/pub/firefox/try-builds/drno@ohlmeier.org-125e6185b561eb7cdf24315df54d61e4ff15dd63/try-win32/firefox-47.0a1.en-US.win32.zip
2) Build with 512KB receive buffer: http://archive.mozilla.org/pub/firefox/try-builds/drno@ohlmeier.org-304a768df9550757e6cac5361cfc0ee44b5c41c4/try-win32/firefox-47.0a1.en-US.win32.zip
3) Build with 256KB receive buffer: http://archive.mozilla.org/pub/firefox/try-builds/drno@ohlmeier.org-856e59e31855da9184b9f8049257d29a59f1de7a/try-win32/firefox-47.0a1.en-US.win32.zip
Flags: needinfo?(rdaware)
Hi Nils,

I tested all three builds. All of them work fine though all show very minimum packet loss. The packet loss remains constant and does not increase over time. So video is smooth on all of them.
Many thanks for these builds.
Flags: needinfo?(rdaware)
Comment on attachment 8724523 [details]
MozReview Request: Bug 1217677: increase UDP socket receive buffer for <= Win7. r=jesup

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/37063/diff/5-6/
One last try build with the buffer size of 256K. Check back on try results
Flags: needinfo?(drno)
Flags: needinfo?(drno)
https://hg.mozilla.org/mozilla-central/rev/bca82ee20a78
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: