Closed Bug 976115 Opened 6 years ago Closed 6 years ago

Higher latency on WebRTC DataChannels than Chrome

Categories

(Core :: WebRTC, defect)

30 Branch
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla30
Tracking Status
firefox27 --- wontfix
firefox28 --- wontfix
firefox29 --- fixed
firefox30 --- fixed
firefox-esr24 --- wontfix
b2g-v1.2 --- unaffected
b2g-v1.3 --- affected
b2g-v1.3T --- affected
b2g-v1.4 --- fixed

People

(Reporter: ashley, Assigned: jesup)

References

()

Details

(Whiteboard: [bugday-20140303])

Attachments

(6 files)

User Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36

Steps to reproduce:

1. Visit a WebRTC DataChannel powered game here: http://www.scirra.com/labs/multiplayer/test1/
2. Open two browser windows.
3. Pick an empty location (0 peers) and join it in both browsers.
4. One browser window should become host (will say 'Host' in top-right). In the other browser window (non-host), observe the 'host ping' measurement.


Actual results:

Firefox 27 & 30 often measure a latency of 40-100ms, even though WebRTC should be making a local zero-latency connection. As a result, the motion in the game is sometimes choppy or laggy.


Expected results:

Run the same example in Chrome and the host latency is 0-2ms. The game is never laggy as the players move around.

Perhaps Firefox has some kind of buffering or blocking that increases latency. However for a multiplayer game adding more than one frame's worth of latency (16ms) is unacceptable.
Component: Untriaged → WebRTC
Product: Firefox → Core
Whiteboard: [bugday-20140303]
Confirmed 70 ms latency with 2 nightly 30.0a1 (2014-03-04) windows, win 7 x64, while there was 1 ms latency with 2 Google Chrome windows.
Status: UNCONFIRMED → NEW
Ever confirmed: true
70ms latency sounds like relay via a TURN server, while chrome is going direct over the LAN.
We don't implement our own TURN, unless the public Google or Mozilla operated STUN servers we use also provide TURN service (although I thought this would have been too expensive for bandwidth to run publicly). Even if it's using TURN, Firefox should be able to connect locally.
We could use about:webrtc data taken while this test is active, including ICE logs, and also it would be helpful to capture NSPR_LOG_MODULES debugs for datachannel:5,sctp:5,signaling:5,timestamp, and a wireshark trace.
Flags: needinfo?(paul.silaghi)
I've run some of my own traces.

Ashley:
What should I be looking for in the logs that would correspond to the numbers you're putting up?  i.e. what is it measuring?  One side (the first) is always showing 0ms, the other shows larger numbers (15-45 typically - note one of my browsers is a debug-no-opt build logging the hell out of things).  I'm guessing the measurement is on the time between when the 2nd to join moves the mouse (and a "move.turn" message is sent on DataChannels) and when it receives at answer ("you moved/turned").

A quick look with maximum logging on shows maybe 0.1ms between data appearing out of ICE to SCTP and when we send ON_DATA to the DOM layer to deliver, but (i this case) 24ms before the I get a send() in response (if I read it correctly).  Then it takes ~0.1ms to get from the DOM send() to passing it down into the DTLS transport layer.

What's happening in the JS layer?  can you turn off all processing in that layer to verify?  Perhaps something is synchronous and is async on chrome....  (gfx change, changing a dom element/image rotation, etc).
Flags: needinfo?(ashley)
NSPR_LOG_MODULES=datachannel:5,sctp:5,signaling:5,mtransport:9,timestamp NSPR_LOG_FILE=nspr.out is more useful, and R_LOG_LEVEL=9 R_LOG_DESTINATION=stderr (or whatever) R_LOG_VERBOSE=1

Without more understanding of what's being measured it's hard to look further even with logs
(In reply to Randell Jesup [:jesup] from comment #4)
> We could use about:webrtc data taken while this test is active, including
> ICE logs, and also it would be helpful to capture NSPR_LOG_MODULES debugs
> for datachannel:5,sctp:5,signaling:5,timestamp, and a wireshark trace.

    I tried this, and found that about:webrtc doesn't show stats for DataChannels. Working on fixing that now (bug 979471).
Attached file about webrtc.txt
(In reply to Randell Jesup [:jesup] from comment #4)
> We could use about:webrtc data taken while this test is active
Flags: needinfo?(paul.silaghi)
Attached file NSPR_LOG_MODULES
> NSPR_LOG_MODULES debugs for datachannel:5,sctp:5,signaling:5,timestamp
(In reply to Randell Jesup [:jesup] from comment #4)
> including ICE logs
how can I get these ?

> and a wireshark trace
I'm not sure what filters should be applied in order to get what we're interested in
It's measuring a simple ping time. One side is the 'host' and runs the authoritative version of the game, so never makes any ping measurements and will show 0 for latency. The other side is making straightforward ping measurements over DataChannels, as in:

1. Send the string "ping" to the host and note the time sent.
2. If the host receives "ping", respond "pong" immediately.
3. Upon receiving "pong", note the time since sending "ping" and divide by two to get the latency. This is the value displayed.

So the round-trip time is actually double the latency. In both the host and the peer, pings and pongs are dealt with first in the message handlers, so it should not be the case that JS execution time contributes a difference. (This is deliberate, since the ping needs to be measured based on network time, not processing time.) So it would appear that the transmission time over DataChannels between two local browsers is unusually high.
Flags: needinfo?(ashley)
(In reply to Paul Silaghi, QA [:pauly] from comment #10)
> (In reply to Randell Jesup [:jesup] from comment #4)
> > including ICE logs
> how can I get these ?
> 

    You can get them from about:webrtc (there is a button that displays all logging). Right now, the session has to be active to get this logging, but there is a patch landing soon that will remove this limitation.

> > and a wireshark trace
> I'm not sure what filters should be applied in order to get what we're
> interested in

    Really, right now all you can do is filter for UDP, and find the port being used in the ICE logging (grep for a candidate pair being marked as "SUCCEEDED"). The traffic will be encrypted, so there will not be an awful lot to look at. Once bug 979471 lands, you'll be able to see what port is in use on about:webrtc without needing to dig through logging.
I took a close look.  The delays are coming in NS_DispatchToMainThread() of the messages (since they must be delivered on MainThread to JS).  This is very likely due to the existing code clogging MainThread causing it to be unresponsive; I'm told Chrome doesn't run GL code on MainThread while we do.  Also, it's simply taking a lot of hits in the code for Layer.prototype.drawGL() in c2runtime.js:17687 (30%, 20% of them directly there and now things called by it) and Runtime.prototype.logic() in c2runtime.js:3806 (25%)

It's also possible it's causing excessive repainting for some reason that Chrome suppresses.  

I'm certain it's not DataChannels.  Please try the ping test with all your game and GL logic turned off.  If that makes the delay go away, we can investigate with the gfx/games/etc people what's causing the mainthread to be slow.

Thanks!
Flags: needinfo?(ashley)
Turning off rendering didn't seem to help much, but it was still running logic (it wouldn't do anything without that). Your suggestion it's to do with the main thread instead of DataChannels does make sense, but I don't think that changes the impact of the bug. Ping time is more important in real-world games than in simple tests. The game is rAF driven so should be hitting 60 FPS, and I don't understand why even if the main thread was active at less than 100% utilisation it would ever add more than 16ms of latency to the ping time. Surely after the next rAF it would fire any queued events? Having an extra latency of 100ms suggests it sat there for a whole six frames before firing onmessage or dispatching a send(). Why couldn't it have done that sooner?
Flags: needinfo?(ashley)
Any chance for a minimal testcase showing the issue?

Also, does Chrome end up opening two processes for the browser windows, but Firefox just has one?
That could have some major effect here. Does the behavior change if one opens two FF processes
(different profiles)?
Note: profiling in "JP_REALTIME" mode.
(In reply to Olli Pettay [:smaug] from comment #15)
> Also, does Chrome end up opening two processes for the browser windows, but
> Firefox just has one?
Yes
> Does the behavior change if one opens two FF processes (different profiles)?
No
Attached file non-realtime profile
non-realtime - i.e. every 2ms of CPU time used, and also will catch all threads, not just mainthread.
Attached file gl-no-paint.html
same non-realtime profile with PresShell::Paint() suppressed 

./jprof -e 'PresShell::Paint(nsView*, nsRegion const&, unsigned int)' firefox jprof-log > /tmp/gl-no-paint.html
reporter, is the latency higher than in Chrome if you run two instances Firefox and don't do any WebGL stuff?
Since based on those profiles it seems to be canvas/painting stuff which is slow, and
running two windows in the same process makes it all worse.

Minimal testcase would be useful.
Hello, I'm also making a webrtc multiplayer game and I independently reached the same problem as this person.

I'm confident that the problem is not in his game logic or webgl. My first webrtc test involved just connecting two peers and sending ping messages at a rapid rate (No graphics no extra logic). And even in that test Chrome behaved normally while Firefox had a very high extra latency.

Unfortunately my tests use a signaling server which I cannot publicly host at the moment, so I cannot share them.
Again, is the latency as bad if you use two different instances of Firefox, not just two windows or tabs?

Some kind of minimal testcase here would be really useful.
The extra latency happens when connecting between different pcs, instances or tabs.
It really feels as if firefox were using nagle's algorithm for sctp.

Here's a minimal-ish testcase I made with peerJS as signaling server:

http://jsfiddle.net/BuL8s/15/
Run the script with host=true to host a ping server. The script will output the server id.

To connect to that server modify the script so that isHost is false and peerId is set to the previously output'd server id.

Hope it helps.
Can you try the DataChannel latency test at http://mozilla.github.com/webrtc-landing ?

That's within two PC's in the same tab.
(In reply to Mario Carbajal from comment #23)
> The extra latency happens when connecting between different pcs, instances
> or tabs.
> It really feels as if firefox were using nagle's algorithm for sctp.

Aha.  Last I recall, it may be.  We were discussing turning that off.  Also, we just (as in this week) updated the SCTP library, which might affect it.  Please try with a nightly from today.

> 
> Here's a minimal-ish testcase I made with peerJS as signaling server:
> 
> http://jsfiddle.net/BuL8s/15/
> Run the script with host=true to host a ping server. The script will output
> the server id.

I get no output at all....

> 
> To connect to that server modify the script so that isHost is false and
> peerId is set to the previously output'd server id.
> 
> Hope it helps.
Flags: needinfo?(useruser)
(In reply to Randell Jesup [:jesup] from comment #25)
> Aha.  Last I recall, it may be.  We were discussing turning that off.

This has to be it. Now that I think about it, if you send only 1 ping message per second the latency will be fine. It goes to hell when you increase the rate to about 20 per second, which definitely looks like what nagle would do.
With nagle's algorithm on WebRTC is virtually useless for latency dependent applications like realtime games. I'm unsure if such applications are what webrtc is intended for. It would be nice if the webrtc spec included some way to tell the browser if nagle's algorithm should be allowed for each datachannel.

> Also, we just (as in this week) updated the SCTP library, which might affect it.
> Please try with a nightly from today.

Tried Nightly and it's the same.

> I get no output at all....

Ooops, sorry I had made a last second change that broke it, try this one:
http://jsfiddle.net/BuL8s/21/
Flags: needinfo?(useruser)
Nagle is the issue.  When I cranked the rate up to 20 pings/sec, latency went way up (100+).  When I added nagle, it sat below 1ms most of the time
Attachment #8390813 - Flags: review?(tuexen)
We should uplift this to 29
Whiteboard: [bugday-20140303] → [bugday-20140303][webrtc-uplift]
Attachment #8390813 - Flags: review?(tuexen) → review+
Comment on attachment 8390813 [details] [diff] [review]
disable Nagle algorithm for DataChannels/sctp

[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A

User impact if declined: Games using WebRTC negatively affected (considerable lag); could hurt adoption of the feature.  Chrome does not have this problem.

Testing completed (on m-c, etc.): on m-i, testcase online (see URL).

Risk to taking this patch (and alternatives if risky): Virtually no risk: trivial patch; SCTP_NODELAY is used in Chrome, recently updated SCTP library so we're not behind on fixes, never any known issues with SCTP_NODELAY upstream.

String or IDL/UUID changes made by this patch: none
Attachment #8390813 - Flags: approval-mozilla-aurora?
Attachment #8390813 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/mozilla-central/rev/8eb7a2a65dfe
Assignee: nobody → rjesup
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
https://hg.mozilla.org/releases/mozilla-aurora/rev/9c91a16318ce
Whiteboard: [bugday-20140303][webrtc-uplift] → [bugday-20140303]
https://hg.mozilla.org/mozilla-central/rev/8eb7a2a65dfe
Whiteboard: [bugday-20140303] → [bugday-20140303][webrtc-uplift]
I'm still seeing a 40-50 ms latency with 2 nightlies 30.0a1 (2014-03-16) win 7 x64 using the first testcase in comment 0.
Thoughts?
Flags: needinfo?(rjesup)
Hi Paul,

I did some testing with Nightly (2014-03-16) on Mac OS X, two tabs in the same application, and two browsers, each on a separate machines connected via WiFi. In both cases was the RTT less than 5ms...
Same when bringing Chrome into the game.
However, two Nightlies (2014-03-16) on a Windows VM show a larger RTT, also there is substantial CPU load.
Not sure why. Running the test in two tabs however, seems better. Do you also see a CPU load?

Best regards
Michael
(In reply to Michael Tüxen from comment #35)
> Running the test in two tabs however, seems better. Do you also see a CPU load?
Hi Michael,
I see ~ the same performance, both on 2 tabs/2 windows - ping 30-50 ms, CPU 20-40%
Can you try using 
http://mozilla.github.com/webrtc-landing/data_latency_test.html?

That removes all the OpenGL and other processing that the original report had (though it's all in one browser).  That clearly shows a major improvement before and after this patch.  However, I had also tracked down that this demo (comment 0) has significant latency from when we DispatchToMainThread() and when it gets processed on MainThread, and the nagle changes won't help with that.

This nagle change should have solved the problem from comment 23 (though I never got that demo to run).

If it's a MainThread latency issue when running OpenGL... that isn't a DataChannel problem.  We could test independently for that by watching MainThread "jank" (hate that term) during the demo.

Taras: what's the current best method to monitor jank?   Can you try the demo from comment 0?
Flags: needinfo?(rjesup) → needinfo?(taras.mozilla)
(In reply to Michael Tüxen from comment #35)
> Hi Paul,
> 
> I did some testing with Nightly (2014-03-16) on Mac OS X, two tabs in the
> same application, and two browsers, each on a separate machines connected
> via WiFi. In both cases was the RTT less than 5ms...
> Same when bringing Chrome into the game.
> However, two Nightlies (2014-03-16) on a Windows VM show a larger RTT, also
> there is substantial CPU load.
> Not sure why. Running the test in two tabs however, seems better. Do you
> also see a CPU load?

In a Windows VM: the demo uses WebGL; using that in a VM could lead to considerable CPU use (and latency).  I therefore wouldn't trust the comment 0 demo in a VM for performance monitoring.

Also since WebGL in FF does run on Mainthread, that can cause MainThread delay depending on the OpenGL support and driver available, which may also be different between Mac and Windows.  Chrome IIRC runs WebGL off their MainThread, so may not hit the same 'jank' issues (if that's what's happening).

> 
> Best regards
> Michael
(In reply to Randell Jesup [:jesup] from comment #37)
> Can you try using 
> http://mozilla.github.com/webrtc-landing/data_latency_test.html? 
> That removes all the OpenGL and other processing that the original report
> had (though it's all in one browser).  That clearly shows a major
> improvement before and after this patch. 
30s ping test, Win 7 x64:
Before the patch:
Ping:
avg=
0.8001941377635486
last=
0.7460107450679061
min=
0.603231898679951
max=
1.8065538632326934

After the patch:
Ping:
avg=
0.7276558270989785
last=
0.7316979511742829
min=
0.5934573077283858
max=
1.2595258625206043
(In reply to Paul Silaghi, QA [:pauly] from comment #39)

Keep in mind that nagle's algorithm wont affect the ping test if the frequency of the messages is low, I suggest testing it with an interval of 50 msec between each ping.
Yes, 50ms is a good target to show the improvement.  The default (1000ms) doesn't show it.

Those numbers are all reasonable.
Whiteboard: [bugday-20140303][webrtc-uplift] → [bugday-20140303]
Jank falls under Vladan now
Flags: needinfo?(taras.mozilla)
(In reply to Randell Jesup [:jesup] from comment #41)
> Yes, 50ms is a good target to show the improvement.  The default (1000ms)
> doesn't show it.
Ok, I confirm the improvements:
Before:
Ping:
avg=148.64360366477717
last=149.9825454342281
min=0.8165245865020552
max=185.22610644492306

After:
Ping:
avg=1.0711079934263623
last=0.8968155890106573
min=0.6905026216754777
max=9.291763539506064

What's to be done about the jank issue ? Probably a follow up bug would be a good start.
yes, I think this is due for a followup bug to resolve the remaining issue.

vladan: see comment 37 about looking into the MainThread 'jank' here - this has an impact on games in Firefox
Flags: needinfo?(vdjeric)
Sorry about the response delay. The Gecko Profiler is the best choice for examining how the main thread is spending its time https://github.com/bgirard/Gecko-Profiler-Addon/blob/master/geckoprofiler.xpi
Flags: needinfo?(vdjeric)
You need to log in before you can comment on or make changes to this bug.