Closed Bug 1240209 Opened 4 years ago Closed 3 years ago

Increased latency with WebRTC data channels

Categories

(Core :: WebRTC: Networking, defect, P1)

38 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed
Blocking Flags:

People

(Reporter: sheeners, Assigned: jesup, NeedInfo)

References

Details

Attachments

(4 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0
Build ID: 20151221154602

Steps to reproduce:

Current Behavior:  
Whenever messages fail to send with EAGAIN/EWOULDBLOCK, Firefox queues up a SendDeferredMessages callback using mDeferTimeout and a one-shot nsITimer. This callback tries to drain the buffer completely. The initial value of mDeferTimeout is 10 ms.
- If the buffer is now empty (i.e., Firefox sent everything to SCTP), the function ends. We’re done here. Note that mDeferTimeout is not modified. It “becomes an estimate of how long we need to wait next time we block,” to quote the source code.
- If at least one packet was sent but the buffer still has data, the timer is requeued. The new mDeferTimeout is max(mDeferTimeout – 1, 10 ms).
- If no packets were sent, the timer is requeued. The new mDeferTimeout is min(mDeferTimeout + 1, 50 ms).




Expected results:

Suggestion to address the latency issue:
Leverage Google’s fix to provide a SCTP send buffer callback to usrsctp_socket() and tie their flow control to that notification
Component: Untriaged → WebRTC: Networking
Product: Firefox → Core
The issue here is more one of large-transfer bandwidth, as this only comes into play when we've filled the sctp user buffer.
Status: UNCONFIRMED → NEW
backlog: --- → webrtc/webaudio+
Rank: 25
Ever confirmed: true
Priority: -- → P2
Thanks, Randell, We have customer use cases that leverage WebRTC datachannel, and are currently blocked on Firefox.  We have been able to get this resolved for Chrome, and need a similar resolution for Firefox as well. Let me know if you need further details - Chris V, cc'd on this bug, is our technical lead on this particular issue, and he can provide more color, if needed.
The default SCTP user buffer is 256 KB, so this is relatively easy to trigger with lots of small transfers. I've had to work around similar issues in the webrtc.org code base prior to Google implementing https://bugs.chromium.org/p/webrtc/issues/detail?id=4468, so I've seen these symptoms in the past. Restarting SCTP sends as soon as the buffer is no longer full caused a significant improvement to latency.
Randell,
This DataChannel latency issue is blocking our product launch. Could you please let me know what needs to be done to raise it to P1 ?

Thx,
NK
Randell has been looking into this and has some detailed follow up questions.  Expect a post from him on this bug very soon. I'm the engineering manager for WebRTC;  if you have questions about the priorities, the roadmap, or who is working on what, feel free to ping, email or needinfo me.  I'm happy to help.
Assignee: nobody → rjesup
Rank: 25 → 15
Priority: P2 → P1
I think there must be some piece of the problem that hasn't been explained.  The description  in comment 0 describes what the code does, but not why this has any negative impact on your usecases or what the problem is.  "Increased latency" is too vague, I'm afraid.

I looked at Google's patch, and did an equivalent to it in our code, and saw little or no change in our transfer rates across a wide range of RTT values.

From their bug:

    | Latency (ms) | Mod Result (K/sec) | Chr 45 Result (K/sec) |
    |--------------+--------------------+-----------------------|
    |            0 |                984 |                   910 |
    |          100 |                772 |                   300 |
    |          200 |                447 |                   153 |
    |          500 |                187 |                    71 |

My results (not the same test, using reep.io on a LAN or local Wifi):

    | Latency (ms) | Mod Result (K/sec) |       Nightly (K/sec) |
    |--------------+--------------------+-----------------------|
    |            0 |              ~4000 |                 ~4000 |
    |          100 |               ~550 |                  ~500 |
    |          200 |               ~340 |                  ~340 |
    |          500 |               ~180 |                  ~170 |

This shows that the adaptive timer for buffer-refill is working pretty well across a wide range of RTT (and if this is to be believed, much better at low-RTT - but I don't have their test).  I note with *both* versions some dips and plateaus at higher RTT; for example at 200ms I see plateaus of ~450K (similar to chrome), but with occasional dips which pull the long-term value down. I plan to take some SCTP wireshark traces to discuss this with Michael Tuexen, but I don't see anything problematical here -- so please help me by filling in the blanks.

Thanks!
Flags: needinfo?(vanderkn)
Flags: needinfo?(sheeners)
Flags: needinfo?(nick.srinivas)
Randell

I'll download your build which uses Chrome's mechanism and check if it helps with our use-cases.

Liubo
To make sure I'm testing the right build:

I did not see any green B's at https://treeherder.mozilla.org/#/jobs?repo=try&revision=bd4b1ee34e18 at 4:53 PST, 05/26/2016 except a green "AB" for "Linux x64 opt", however, clicking its "Build" leads to "Artifact not found". So instead I chose a green "mda" for "Windows 7 opt" and downloaded firefox-49.0a1.en-US.win32.zip, which seems to be running.

I'll use this for the test.
Here are the test results: 

Test: push data through data channel at rate up to 1MBps using three different clients: Chrome(50.0.2661.102), Firefox without Fix(46.0.1), and Firefox with Fix(bd4b1ee34e18). Did local(on one machine) and remote(on two machines) tests for both reliable and unreliable data channels. So totally there were 4 tests, each test was about 2 mins. 

Results: For all test cases, for Firefox, the one with Fix easily beat the one without Fix.

Conclusion: The fix did help with data channel throughput.

Suggestion: Check-in the fix. Also, it's noticed that Chrome listens to SCTP_SENDER_DRY_EVENT and emit SignalReadyToSend as the above fix, probably it's better for Firefox to bring that in too.

Unreliable Data Channel Test Results( Reliable Data Channel has same results )
=================================================================================
                        Remote                Local
Client                Throughput            Throughput
Chrome                  470KBps          335KBps -> 286KBps
FF w/o Fix          300KBps -> 160KBps   320KBps -> 176KBps
FF w Fix            475KBps -> 416KBps   460KBps -> 350KBps
(In reply to Liubo Chen from comment #9)
> Here are the test results: 
> 
> Test: push data through data channel at rate up to 1MBps using three
> different clients: Chrome(50.0.2661.102), Firefox without Fix(46.0.1), and
> Firefox with Fix(bd4b1ee34e18). Did local(on one machine) and remote(on two
> machines) tests for both reliable and unreliable data channels. So totally
> there were 4 tests, each test was about 2 mins. 

Can you provide a snippet of the code for this test? Or at minimum describe what's sending the data and what's done receiving it (and how the rate is measured). The numbers you're showing don't make sense to me (for Chrome or Firefox).

Also, what does X KBps-> Y KBps mean in your table?  Started at X, ended at Y, or varied from X to Y, or max X, min Y?  Also, I assume that's KBytes/s, not Kbits/s

For a 1Mbps source fed through a DataChannel on a LAN, let alone the same machine, should produce 1Mbps transfer or close to.  I was getting sustained 4MBps on a LAN between two machines, and in a tab-to-tab within the same browser I got ~12MB/s - both with or without the fix.  At higher RTTs I do see some plateaus and dips, but it's unclear if they're better or worse with the fix.

I don't disbelieve the test showed this, but I want to ensure I understand what was being tested.

If you want to go one step further: set NSPR_LOG_MODULES=SCTP:5,DataChannel:4 NSPR_LOG_FILE=some temp file before running Firefox, and take the log from from each (with and without the fix) and provide it to me (dropbox, etc).  Then I can see exactly what's happening at the SCTP level.

> Suggestion: Check-in the fix. Also, it's noticed that Chrome listens to
> SCTP_SENDER_DRY_EVENT and emit SignalReadyToSend as the above fix, probably
> it's better for Firefox to bring that in too.

I'll look at that.
MozReview-Commit-ID: 3NrTaCbCuvJ
Attachment #8757565 - Flags: review?(drno)
Can't add tuexen for reviews since he hasn't logged into bugzilla recently
MozReview-Commit-ID: BPjqdRCJxio
Attachment #8757566 - Flags: review?(drno)
It really makes sense to use the callbacks instead of running an additional timer on top of the SCTP stack.
So I would vote for such a mechanism.
Comment on attachment 8757565 [details] [diff] [review]
use buffer-half-empty callbacks to send buffered data instead of a timer

Now that he signed in, I can ask for a review! :-)

I agree, it makes sense to use them - but I want to be sure this is actually going to fix their problem (and maybe do more if I can).
Attachment #8757565 - Flags: review?(tuexen)
Comment on attachment 8757565 [details] [diff] [review]
use buffer-half-empty callbacks to send buffered data instead of a timer

Review of attachment 8757565 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM, but Michael's opinion should probably count more :-)

::: netwerk/sctp/datachannel/DataChannel.cpp
@@ +484,5 @@
>  
>  void
>  DataChannelConnection::StartDefer()
>  {
> +#if 0

I think for an official landing of this (vs. just providing a try build for testing) this dead code should be deleted (and resurrected from versioning system if needed).

@@ +2674,4 @@
>    for (auto& buffer : mBufferedData) {
>      buffered += buffer->mLength;
>    }
>    // XXX Note: per Michael Tuexen, there's no way to currently get the buffered

Does this comment still apply?
Attachment #8757565 - Flags: review?(drno) → review+
Comment on attachment 8757566 [details] [diff] [review]
Remove adaptive timer for sending buffered DataChannel messages

Review of attachment 8757566 [details] [diff] [review]:
-----------------------------------------------------------------

I'm wondering if all scenarios where we used to re-try via StartDefer() are really covered by the threshold now?
Might it be a better (even thought more complex) solution to have the timer still around but with bigger timeout values to retry in cases where the threshold did not resulted in the expected resend?

::: netwerk/sctp/datachannel/DataChannel.h
@@ -561,5 @@
>            default:
>              break;
>          }
>          break;
> -      case START_DEFER:

I think you can remove |START_DEFER| from the enum here https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/datachannel/DataChannel.h#457 as well
Comment on attachment 8757565 [details] [diff] [review]
use buffer-half-empty callbacks to send buffered data instead of a timer

Review of attachment 8757565 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM, but Michael's opinion should probably count more :-)

::: netwerk/sctp/datachannel/DataChannel.cpp
@@ +2674,4 @@
>    for (auto& buffer : mBufferedData) {
>      buffered += buffer->mLength;
>    }
>    // XXX Note: per Michael Tuexen, there's no way to currently get the buffered

Does this comment still apply?
Comment on attachment 8757565 [details] [diff] [review]
use buffer-half-empty callbacks to send buffered data instead of a timer

Review of attachment 8757565 [details] [diff] [review]:
-----------------------------------------------------------------

Look good to me.

::: netwerk/sctp/datachannel/DataChannel.cpp
@@ +2674,5 @@
>    for (auto& buffer : mBufferedData) {
>      buffered += buffer->mLength;
>    }
>    // XXX Note: per Michael Tuexen, there's no way to currently get the buffered
>    // amount from the SCTP stack for a single stream.  It is on their to-do

Yes it does. There is currently no socket option providing this information.
Attachment #8757565 - Flags: review?(tuexen) → review+
I went over the locking with a fine-tooth-comb, and found one hole/race (extremely unlikely race, but possible)
Attachment #8758102 - Flags: review?(drno)
Below is the snippet of the code: basically I was creating an interval which keeps pushing about 2KB data with order # attached into the datachannel every 2ms and then measure the throughput. I was simulating our use-case( I believe if we make the packet larger we'll get more throughput ).

X KBps-> Y KBps means starting at around X KBps and stables at around Y KBps after around 2 mins. So you can tell that Chrome is pretty stable.

Data were verified received at the receiver side.


============================================================
 
        var testDataChannelInterval;
        testDataChannel.onOpen = function () {
            logger.info(TAG, "(local) testDataChannel.onOpen()");
            if (testDataChannel) {
                var amountSent = 0;
                var startTime = new Date().getTime();
                var rate = 0;
                var msg = "";
                for(var i = 0; i < 2000; ++i) {
                    msg += "A";
                }
                var index = 0;
                
                testDataChannelInterval = setInterval(
                    function () {
                        var currentTime = new Date().getTime();
                        var tmp = index + ": " + msg;
                        index++;
                        
                        testDataChannel.send(tmp);
                        amountSent += tmp.length;
                        rate = amountSent / (currentTime - startTime);
                        logger.info(TAG, "Current throughput: " + rate);
                    },
                    2);
            }
        };

        testDataChannel.onClose = function () {
            logger.info(TAG, "(local) testDataChannel.onClose()");
            if (testDataChannel) {
                testDataChannel = null;
            }
            if(testDataChannelInterval) {
                clearInterval(testDataChannelInterval);
                testDataChannelInterval = null;
            }            
        };
        testDataChannel.onError = function () {
            logger.error(TAG, "(local) testDataChannel.onError()");
            if (testDataChannel) {
                testDataChannel = null;
            }
            
            if(testDataChannelInterval) {
                clearInterval(testDataChannelInterval);
                testDataChannelInterval = null;
            }
        };
I'm slightly worried over calculating the rate every 2ms... depending on what 'logger' does in particular, but it also is going to throw a LOT of garbage for GC (all the new's, etc).  But it may not make a ton of difference on a fast machine with lots of memory.

I modified https://mozilla.github.io/webrtc-landing/data_test.html to include a modified version of this, and I see rates start around 460-480, and climb slightly to mid-480's or 490 (on my fast XEON which was also doing an all-cores-busy compile!).  With no compiles, it starts around 490 and climbs to 500.  This is with nightly, opt, linux64, local build, without this fix.
With the fix, it also quickly converges on 500...  perhaps it hits it a couple of seconds earlier.  This is within a tab, so basically going through a loopback device (0 RTT).  

These don't match your testing....

One question - are you checking that interval timeouts aren't getting dropped?  If an interval takes too long, the next one will be queued.  However, only *one* will be queued, so if one takes 10ms *for any reason* (including UI hangs), two will fire for that 10ms chunk instead 5.  If they all take 10ms to fire, or close, you'll run at 1/5th normal speed or so.

Note that unlike FF, without e10s (multiprocess, default in aurora (48) and nightly 49 currently), the UI runs in the same process and same main/JS event queue.  This can make high-frequency timers somewhat erratic in firing times.
Attached file data_test.html
Comment on attachment 8758102 [details] [diff] [review]
clean up one issue with locking access to mBufferedData

Review of attachment 8758102 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8758102 - Flags: review?(drno) → review+
So I also ran the attached data_test.html with and without the patch on my MacBook Pro and don't see any difference. And as Randell I also get close 500.

I noticed that the numbers which get reported from JS look different to what a network monitor, LittleSnitch in this case, tells me about the data throughput. Basically LittleSnitch's network monitor reports super stable 550kB/s when executing the throughout test on the data_test.html page. But as soon as I switch to another tab while the throughput test is running the network monitor reports a throughput of only 2kB/s, until I switch back to the tab which runs the throughput test. I'm guessing that is some JS optimization which slows down setInterval() in case a tab does not have the focus.
Now the throughput report on data_test.html obviously reports really low throughput numbers after the tab switching. And because it calculates the average over time the number recovers only veeerrry slowly. But at the same time the network monitor reports that immediately after bringing the throughput test tab into focus the data throughput is back to 550kB/s.

So just to be safe: Firefox and the tab which executes your JS test code stays in the foreground the whole time during your test run, correct?
Comment on attachment 8757566 [details] [diff] [review]
Remove adaptive timer for sending buffered DataChannel messages

Review of attachment 8757566 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8757566 - Flags: review?(drno) → review+
Right. Since I need to look at the console output, it's always focused.
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fa2962ff448e
use buffer-half-empty callbacks to send buffered data instead of a timer r=tuexen,drno
(The checkin was of all three patches, merged.)

Timer callbacks for defocused tabs are severely slowed down to around 1/second, and may eventually stop (chrome does similar, and maybe even more (full stop after a bit)).

I should note that opening/selecting the browser console will cause a dip in the results (not major), but doesn't count as defocusing the tab.

One thing I *will* tell you is that the measurement should be done at the receiving side, not the sending.  Otherwise locally-buffered data "counts" - and note that channel.send() (at least in FF) doesn't block even if the SCTP buffers are full; we just queue the message for eventual transmission. You can see this trivially by setting the size for each message to 20000 instead of 2000; the reported "rate" becomes ~5000 - because it's measuring the number of buffered messages, not data transfer rate.


For any sort of realtime transfer, you don't want to have data be buffered either in SCTP or outside of it.

For a real use of mouse and keyboard data, I would only insert events when bufferedAmount() is 0, and otherwise accumulate them.  And that (currently) just tells you that there is no data buffered *outside* of SCTP, since the SCTP stack doesn't expose a per-channel socket option to ask for the amount buffered.  I suspect since we both use the same SCTP stack that Chrome has a similar limitation on bufferedAmount().

A much-more-complex solution would be to add an application-level ack in order to try to keep the buffered data down to near zero.  But this verges on delay-sensitive-congestion-control in JS, which is not a good idea. (not even close).

If your data is small/bursty, then buffering may rarely come into play at all - and thus this patch would have no effect in practice, and you should just send data as soon as you have it.  This patch almost solely applies to bulk data transfers, or transfers in parallel with audio/video which are enough to fill the bottleneck node, or if there's enough competing cross-traffic to drop throughput rates or make them very bursty.
https://hg.mozilla.org/mozilla-central/rev/fa2962ff448e
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Thanks for pushing the changes!

Ahh, we do have the mechanism to insert event only if bufferedAmount > 0. As mentioned above, data were verified to be received by the receiver in order, so as a quick test, I just did the calculation at the sender side. And, we do send data in parallel with A/V in some use-cases.

Thanks for your help with this.
Thanks for the fix. Will this be backported to any ESR releases?
Flags: needinfo?(vanderkn)
(In reply to vanderkn from comment #33)
> Thanks for the fix. Will this be backported to any ESR releases?

No - generally only security or major stability fixes are backported to ESR.  This will "ride the trains" to release in Firefox 49.  It will be in the next ESR, which I think is 52.

As I noted, the test you're doing really isn't valid, and I'm still unclear as to why you believe this change will help your application (or what was wrong before).  This patch really only affects cases where you've sent enough data that hasn't been acked that the SCTP buffers are full - something that I'd want very much to avoid for mouse/keyboard events!

I landed the patch because it's a better solution than the timer and gets rid of a bunch of code.  Again, I can believe there are circumstances where the callback works better, but per my transfer measurements I think it worked well already.
Hi Randell, 
Thanks a ton for getting this fix in!
Firefox 49 will be released in Sept correct?
Is there a possibility of getting this in an earlier release?

Thx,
NK
Not really - the patch is non-trivial and changes important data-handling sequences, and there's no sec issue or crashing.

Again (and I realize I'm repeating myself) if I knew exactly what the problem you're seeing is and how it affects your application, I might be able to tweak the existing implementation and uplift a tweak (adjust the timing logic, etc).
Blocks: 1280041
You need to log in before you can comment on or make changes to this bug.