Closed
Bug 976115
Opened 11 years ago
Closed 11 years ago
Higher latency on WebRTC DataChannels than Chrome
Categories
(Core :: WebRTC, defect)
Tracking
()
RESOLVED
FIXED
mozilla30
People
(Reporter: ashley, Assigned: jesup)
References
()
Details
(Whiteboard: [bugday-20140303])
Attachments
(6 files)
115.83 KB,
text/plain
|
Details | |
3.16 MB,
text/plain
|
Details | |
1.44 MB,
text/html
|
Details | |
1.54 MB,
text/html
|
Details | |
917.19 KB,
text/html
|
Details | |
1.41 KB,
patch
|
tuexen
:
review+
bkerensa
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36
Steps to reproduce:
1. Visit a WebRTC DataChannel powered game here: http://www.scirra.com/labs/multiplayer/test1/
2. Open two browser windows.
3. Pick an empty location (0 peers) and join it in both browsers.
4. One browser window should become host (will say 'Host' in top-right). In the other browser window (non-host), observe the 'host ping' measurement.
Actual results:
Firefox 27 & 30 often measure a latency of 40-100ms, even though WebRTC should be making a local zero-latency connection. As a result, the motion in the game is sometimes choppy or laggy.
Expected results:
Run the same example in Chrome and the host latency is 0-2ms. The game is never laggy as the players move around.
Perhaps Firefox has some kind of buffering or blocking that increases latency. However for a multiplayer game adding more than one frame's worth of latency (16ms) is unacceptable.
Updated•11 years ago
|
Component: Untriaged → WebRTC
Product: Firefox → Core
Whiteboard: [bugday-20140303]
Comment 1•11 years ago
|
||
Confirmed 70 ms latency with 2 nightly 30.0a1 (2014-03-04) windows, win 7 x64, while there was 1 ms latency with 2 Google Chrome windows.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee | ||
Comment 2•11 years ago
|
||
70ms latency sounds like relay via a TURN server, while chrome is going direct over the LAN.
We don't implement our own TURN, unless the public Google or Mozilla operated STUN servers we use also provide TURN service (although I thought this would have been too expensive for bandwidth to run publicly). Even if it's using TURN, Firefox should be able to connect locally.
Assignee | ||
Comment 4•11 years ago
|
||
We could use about:webrtc data taken while this test is active, including ICE logs, and also it would be helpful to capture NSPR_LOG_MODULES debugs for datachannel:5,sctp:5,signaling:5,timestamp, and a wireshark trace.
Flags: needinfo?(paul.silaghi)
Assignee | ||
Comment 5•11 years ago
|
||
I've run some of my own traces.
Ashley:
What should I be looking for in the logs that would correspond to the numbers you're putting up? i.e. what is it measuring? One side (the first) is always showing 0ms, the other shows larger numbers (15-45 typically - note one of my browsers is a debug-no-opt build logging the hell out of things). I'm guessing the measurement is on the time between when the 2nd to join moves the mouse (and a "move.turn" message is sent on DataChannels) and when it receives at answer ("you moved/turned").
A quick look with maximum logging on shows maybe 0.1ms between data appearing out of ICE to SCTP and when we send ON_DATA to the DOM layer to deliver, but (i this case) 24ms before the I get a send() in response (if I read it correctly). Then it takes ~0.1ms to get from the DOM send() to passing it down into the DTLS transport layer.
What's happening in the JS layer? can you turn off all processing in that layer to verify? Perhaps something is synchronous and is async on chrome.... (gfx change, changing a dom element/image rotation, etc).
Flags: needinfo?(ashley)
Assignee | ||
Comment 6•11 years ago
|
||
NSPR_LOG_MODULES=datachannel:5,sctp:5,signaling:5,mtransport:9,timestamp NSPR_LOG_FILE=nspr.out is more useful, and R_LOG_LEVEL=9 R_LOG_DESTINATION=stderr (or whatever) R_LOG_VERBOSE=1
Without more understanding of what's being measured it's hard to look further even with logs
Comment 7•11 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #4)
> We could use about:webrtc data taken while this test is active, including
> ICE logs, and also it would be helpful to capture NSPR_LOG_MODULES debugs
> for datachannel:5,sctp:5,signaling:5,timestamp, and a wireshark trace.
I tried this, and found that about:webrtc doesn't show stats for DataChannels. Working on fixing that now (bug 979471).
Comment 8•11 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #4)
> We could use about:webrtc data taken while this test is active
Flags: needinfo?(paul.silaghi)
Comment 9•11 years ago
|
||
> NSPR_LOG_MODULES debugs for datachannel:5,sctp:5,signaling:5,timestamp
Comment 10•11 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #4)
> including ICE logs
how can I get these ?
> and a wireshark trace
I'm not sure what filters should be applied in order to get what we're interested in
Reporter | ||
Comment 11•11 years ago
|
||
It's measuring a simple ping time. One side is the 'host' and runs the authoritative version of the game, so never makes any ping measurements and will show 0 for latency. The other side is making straightforward ping measurements over DataChannels, as in:
1. Send the string "ping" to the host and note the time sent.
2. If the host receives "ping", respond "pong" immediately.
3. Upon receiving "pong", note the time since sending "ping" and divide by two to get the latency. This is the value displayed.
So the round-trip time is actually double the latency. In both the host and the peer, pings and pongs are dealt with first in the message handlers, so it should not be the case that JS execution time contributes a difference. (This is deliberate, since the ping needs to be measured based on network time, not processing time.) So it would appear that the transmission time over DataChannels between two local browsers is unusually high.
Flags: needinfo?(ashley)
Comment 12•11 years ago
|
||
(In reply to Paul Silaghi, QA [:pauly] from comment #10)
> (In reply to Randell Jesup [:jesup] from comment #4)
> > including ICE logs
> how can I get these ?
>
You can get them from about:webrtc (there is a button that displays all logging). Right now, the session has to be active to get this logging, but there is a patch landing soon that will remove this limitation.
> > and a wireshark trace
> I'm not sure what filters should be applied in order to get what we're
> interested in
Really, right now all you can do is filter for UDP, and find the port being used in the ICE logging (grep for a candidate pair being marked as "SUCCEEDED"). The traffic will be encrypted, so there will not be an awful lot to look at. Once bug 979471 lands, you'll be able to see what port is in use on about:webrtc without needing to dig through logging.
Assignee | ||
Comment 13•11 years ago
|
||
I took a close look. The delays are coming in NS_DispatchToMainThread() of the messages (since they must be delivered on MainThread to JS). This is very likely due to the existing code clogging MainThread causing it to be unresponsive; I'm told Chrome doesn't run GL code on MainThread while we do. Also, it's simply taking a lot of hits in the code for Layer.prototype.drawGL() in c2runtime.js:17687 (30%, 20% of them directly there and now things called by it) and Runtime.prototype.logic() in c2runtime.js:3806 (25%)
It's also possible it's causing excessive repainting for some reason that Chrome suppresses.
I'm certain it's not DataChannels. Please try the ping test with all your game and GL logic turned off. If that makes the delay go away, we can investigate with the gfx/games/etc people what's causing the mainthread to be slow.
Thanks!
Assignee | ||
Updated•11 years ago
|
Flags: needinfo?(ashley)
Reporter | ||
Comment 14•11 years ago
|
||
Turning off rendering didn't seem to help much, but it was still running logic (it wouldn't do anything without that). Your suggestion it's to do with the main thread instead of DataChannels does make sense, but I don't think that changes the impact of the bug. Ping time is more important in real-world games than in simple tests. The game is rAF driven so should be hitting 60 FPS, and I don't understand why even if the main thread was active at less than 100% utilisation it would ever add more than 16ms of latency to the ping time. Surely after the next rAF it would fire any queued events? Having an extra latency of 100ms suggests it sat there for a whole six frames before firing onmessage or dispatching a send(). Why couldn't it have done that sooner?
Flags: needinfo?(ashley)
Comment 15•11 years ago
|
||
Any chance for a minimal testcase showing the issue?
Also, does Chrome end up opening two processes for the browser windows, but Firefox just has one?
That could have some major effect here. Does the behavior change if one opens two FF processes
(different profiles)?
Assignee | ||
Comment 16•11 years ago
|
||
Note: profiling in "JP_REALTIME" mode.
Comment 17•11 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #15)
> Also, does Chrome end up opening two processes for the browser windows, but
> Firefox just has one?
Yes
> Does the behavior change if one opens two FF processes (different profiles)?
No
Assignee | ||
Comment 18•11 years ago
|
||
non-realtime - i.e. every 2ms of CPU time used, and also will catch all threads, not just mainthread.
Assignee | ||
Comment 19•11 years ago
|
||
same non-realtime profile with PresShell::Paint() suppressed
./jprof -e 'PresShell::Paint(nsView*, nsRegion const&, unsigned int)' firefox jprof-log > /tmp/gl-no-paint.html
Comment 20•11 years ago
|
||
reporter, is the latency higher than in Chrome if you run two instances Firefox and don't do any WebGL stuff?
Since based on those profiles it seems to be canvas/painting stuff which is slow, and
running two windows in the same process makes it all worse.
Minimal testcase would be useful.
Comment 21•11 years ago
|
||
Hello, I'm also making a webrtc multiplayer game and I independently reached the same problem as this person.
I'm confident that the problem is not in his game logic or webgl. My first webrtc test involved just connecting two peers and sending ping messages at a rapid rate (No graphics no extra logic). And even in that test Chrome behaved normally while Firefox had a very high extra latency.
Unfortunately my tests use a signaling server which I cannot publicly host at the moment, so I cannot share them.
Comment 22•11 years ago
|
||
Again, is the latency as bad if you use two different instances of Firefox, not just two windows or tabs?
Some kind of minimal testcase here would be really useful.
Comment 23•11 years ago
|
||
The extra latency happens when connecting between different pcs, instances or tabs.
It really feels as if firefox were using nagle's algorithm for sctp.
Here's a minimal-ish testcase I made with peerJS as signaling server:
http://jsfiddle.net/BuL8s/15/
Run the script with host=true to host a ping server. The script will output the server id.
To connect to that server modify the script so that isHost is false and peerId is set to the previously output'd server id.
Hope it helps.
Assignee | ||
Comment 24•11 years ago
|
||
Can you try the DataChannel latency test at http://mozilla.github.com/webrtc-landing ?
That's within two PC's in the same tab.
Assignee | ||
Comment 25•11 years ago
|
||
(In reply to Mario Carbajal from comment #23)
> The extra latency happens when connecting between different pcs, instances
> or tabs.
> It really feels as if firefox were using nagle's algorithm for sctp.
Aha. Last I recall, it may be. We were discussing turning that off. Also, we just (as in this week) updated the SCTP library, which might affect it. Please try with a nightly from today.
>
> Here's a minimal-ish testcase I made with peerJS as signaling server:
>
> http://jsfiddle.net/BuL8s/15/
> Run the script with host=true to host a ping server. The script will output
> the server id.
I get no output at all....
>
> To connect to that server modify the script so that isHost is false and
> peerId is set to the previously output'd server id.
>
> Hope it helps.
Flags: needinfo?(useruser)
Comment 26•11 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #25)
> Aha. Last I recall, it may be. We were discussing turning that off.
This has to be it. Now that I think about it, if you send only 1 ping message per second the latency will be fine. It goes to hell when you increase the rate to about 20 per second, which definitely looks like what nagle would do.
With nagle's algorithm on WebRTC is virtually useless for latency dependent applications like realtime games. I'm unsure if such applications are what webrtc is intended for. It would be nice if the webrtc spec included some way to tell the browser if nagle's algorithm should be allowed for each datachannel.
> Also, we just (as in this week) updated the SCTP library, which might affect it.
> Please try with a nightly from today.
Tried Nightly and it's the same.
> I get no output at all....
Ooops, sorry I had made a last second change that broke it, try this one:
http://jsfiddle.net/BuL8s/21/
Flags: needinfo?(useruser)
Assignee | ||
Comment 27•11 years ago
|
||
Nagle is the issue. When I cranked the rate up to 20 pings/sec, latency went way up (100+). When I added nagle, it sat below 1ms most of the time
Assignee | ||
Updated•11 years ago
|
Attachment #8390813 -
Flags: review?(tuexen)
Assignee | ||
Comment 28•11 years ago
|
||
We should uplift this to 29
Whiteboard: [bugday-20140303] → [bugday-20140303][webrtc-uplift]
Updated•11 years ago
|
Attachment #8390813 -
Flags: review?(tuexen) → review+
Assignee | ||
Comment 29•11 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/8eb7a2a65dfe
Can be tested here:
http://mozilla.github.com/webrtc-landing/data_latency_test.html
status-b2g-v1.2:
--- → unaffected
status-b2g-v1.3:
--- → affected
status-b2g-v1.3T:
--- → affected
status-b2g-v1.4:
--- → affected
status-firefox27:
--- → wontfix
status-firefox28:
--- → wontfix
status-firefox29:
--- → affected
status-firefox30:
--- → affected
status-firefox-esr24:
--- → wontfix
OS: Windows 8.1 → All
Hardware: x86_64 → All
Target Milestone: --- → mozilla30
Assignee | ||
Comment 30•11 years ago
|
||
Comment on attachment 8390813 [details] [diff] [review]
disable Nagle algorithm for DataChannels/sctp
[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A
User impact if declined: Games using WebRTC negatively affected (considerable lag); could hurt adoption of the feature. Chrome does not have this problem.
Testing completed (on m-c, etc.): on m-i, testcase online (see URL).
Risk to taking this patch (and alternatives if risky): Virtually no risk: trivial patch; SCTP_NODELAY is used in Chrome, recently updated SCTP library so we're not behind on fixes, never any known issues with SCTP_NODELAY upstream.
String or IDL/UUID changes made by this patch: none
Attachment #8390813 -
Flags: approval-mozilla-aurora?
Updated•11 years ago
|
Attachment #8390813 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Assignee | ||
Comment 31•11 years ago
|
||
Assignee: nobody → rjesup
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 32•11 years ago
|
||
Whiteboard: [bugday-20140303][webrtc-uplift] → [bugday-20140303]
Comment 33•11 years ago
|
||
Whiteboard: [bugday-20140303] → [bugday-20140303][webrtc-uplift]
Updated•11 years ago
|
Comment 34•11 years ago
|
||
I'm still seeing a 40-50 ms latency with 2 nightlies 30.0a1 (2014-03-16) win 7 x64 using the first testcase in comment 0.
Thoughts?
Flags: needinfo?(rjesup)
Comment 35•11 years ago
|
||
Hi Paul,
I did some testing with Nightly (2014-03-16) on Mac OS X, two tabs in the same application, and two browsers, each on a separate machines connected via WiFi. In both cases was the RTT less than 5ms...
Same when bringing Chrome into the game.
However, two Nightlies (2014-03-16) on a Windows VM show a larger RTT, also there is substantial CPU load.
Not sure why. Running the test in two tabs however, seems better. Do you also see a CPU load?
Best regards
Michael
Comment 36•11 years ago
|
||
(In reply to Michael Tüxen from comment #35)
> Running the test in two tabs however, seems better. Do you also see a CPU load?
Hi Michael,
I see ~ the same performance, both on 2 tabs/2 windows - ping 30-50 ms, CPU 20-40%
Assignee | ||
Comment 37•11 years ago
|
||
Can you try using
http://mozilla.github.com/webrtc-landing/data_latency_test.html?
That removes all the OpenGL and other processing that the original report had (though it's all in one browser). That clearly shows a major improvement before and after this patch. However, I had also tracked down that this demo (comment 0) has significant latency from when we DispatchToMainThread() and when it gets processed on MainThread, and the nagle changes won't help with that.
This nagle change should have solved the problem from comment 23 (though I never got that demo to run).
If it's a MainThread latency issue when running OpenGL... that isn't a DataChannel problem. We could test independently for that by watching MainThread "jank" (hate that term) during the demo.
Taras: what's the current best method to monitor jank? Can you try the demo from comment 0?
Flags: needinfo?(rjesup) → needinfo?(taras.mozilla)
Assignee | ||
Comment 38•11 years ago
|
||
(In reply to Michael Tüxen from comment #35)
> Hi Paul,
>
> I did some testing with Nightly (2014-03-16) on Mac OS X, two tabs in the
> same application, and two browsers, each on a separate machines connected
> via WiFi. In both cases was the RTT less than 5ms...
> Same when bringing Chrome into the game.
> However, two Nightlies (2014-03-16) on a Windows VM show a larger RTT, also
> there is substantial CPU load.
> Not sure why. Running the test in two tabs however, seems better. Do you
> also see a CPU load?
In a Windows VM: the demo uses WebGL; using that in a VM could lead to considerable CPU use (and latency). I therefore wouldn't trust the comment 0 demo in a VM for performance monitoring.
Also since WebGL in FF does run on Mainthread, that can cause MainThread delay depending on the OpenGL support and driver available, which may also be different between Mac and Windows. Chrome IIRC runs WebGL off their MainThread, so may not hit the same 'jank' issues (if that's what's happening).
>
> Best regards
> Michael
Comment 39•11 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #37)
> Can you try using
> http://mozilla.github.com/webrtc-landing/data_latency_test.html?
> That removes all the OpenGL and other processing that the original report
> had (though it's all in one browser). That clearly shows a major
> improvement before and after this patch.
30s ping test, Win 7 x64:
Before the patch:
Ping:
avg=
0.8001941377635486
last=
0.7460107450679061
min=
0.603231898679951
max=
1.8065538632326934
After the patch:
Ping:
avg=
0.7276558270989785
last=
0.7316979511742829
min=
0.5934573077283858
max=
1.2595258625206043
Comment 40•11 years ago
|
||
(In reply to Paul Silaghi, QA [:pauly] from comment #39)
Keep in mind that nagle's algorithm wont affect the ping test if the frequency of the messages is low, I suggest testing it with an interval of 50 msec between each ping.
Assignee | ||
Comment 41•11 years ago
|
||
Yes, 50ms is a good target to show the improvement. The default (1000ms) doesn't show it.
Those numbers are all reasonable.
Assignee | ||
Updated•11 years ago
|
Whiteboard: [bugday-20140303][webrtc-uplift] → [bugday-20140303]
Comment 43•11 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #41)
> Yes, 50ms is a good target to show the improvement. The default (1000ms)
> doesn't show it.
Ok, I confirm the improvements:
Before:
Ping:
avg=148.64360366477717
last=149.9825454342281
min=0.8165245865020552
max=185.22610644492306
After:
Ping:
avg=1.0711079934263623
last=0.8968155890106573
min=0.6905026216754777
max=9.291763539506064
What's to be done about the jank issue ? Probably a follow up bug would be a good start.
Assignee | ||
Comment 44•11 years ago
|
||
yes, I think this is due for a followup bug to resolve the remaining issue.
vladan: see comment 37 about looking into the MainThread 'jank' here - this has an impact on games in Firefox
Flags: needinfo?(vdjeric)
Comment 45•10 years ago
|
||
Sorry about the response delay. The Gecko Profiler is the best choice for examining how the main thread is spending its time https://github.com/bgirard/Gecko-Profiler-Addon/blob/master/geckoprofiler.xpi
Flags: needinfo?(vdjeric)
You need to log in
before you can comment on or make changes to this bug.
Description
•