Closed Bug 842283 Opened 12 years ago Closed 12 years ago

data over reliable data channel drops

Categories

(Core :: WebRTC: Networking, defect, P2)

x86_64
Windows 7
defect

Tracking

()

RESOLVED DUPLICATE of bug 896228

People

(Reporter: shacharz, Assigned: jesup)

Details

(Whiteboard: [webrtc][blocking-webrtc-])

Attachments

(1 file)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.84 Safari/537.22 Steps to reproduce: Created a test page: sharefest.peer5.com where a reliable dataChannel is created, and then peer1 asks from peer2 for chunks of the file he has, to which peer2 replies by sending the chunks of data Actual results: sometime those chunks of data aren't received Expected results: chunks of data over reliable should "always" be received -always is under considerable conditions, ofcourse if the connection to the other peer is lost than there could be packet loss.
Component: Untriaged → WebRTC: Networking
Product: Firefox → Core
QA Contact: jsmith
Version: 21 Branch → Trunk
shachar - can you verify this is seen in a build made after but 837103 was fixed? It would have caused these exact symptoms. It landed in m-c on Feb 6, and would have been in the Feb 7 nightly. Changeset https://hg.mozilla.org/mozilla-central/rev/0383bb82c925
Flags: needinfo?(shacharz)
bug 837103 of course...
yes, and although it is more likely to happen between 2 computers and over wireless it sometimes happen in localhost.
Flags: needinfo?(shacharz)
Ok, thanks. Can you try a debug build, and set NSPR_LOG_MODULES=datachannel:5,sctp:5 and NSPR_LOG_FILE=whatever and then attach the output here?
Assignee: nobody → rjesup
Status: UNCONFIRMED → NEW
Ever confirmed: true
Priority: -- → P2
Whiteboard: [webrtc][blocking-webrtc+]
Shachar, can you be a bit more specific how to reproduce the problem? Just try one file? Which size? Anything else? Best regards Michael
Michael: if you're using sharefest, then yes share 1 file (e.g 20MB), open a new tab with the dynamic link created. And notice that in the receiver's console you'll see "expire.." that happens when the chunk doesn't reach its destination after 1 second.
Shachar: I can reproduce the problem you are describing. However, it doesn't be related to unreliable transfer. The Wireshark trace shows no indication that PR-SCTP is being used. What I see is the usage of two datachannels, both transferring strings, not binary data. On one data channel you are transferring small strings, about 400 bytes. After the transmission on the other data channel stops, these messages are sent every 2 seconds. So do you have an app timer running? The other datachannel transfers messages of size about 6800 bytes. After some number of messages (2730 in my case), no further messages are sent on that data channel. So something is missing. The SCTP trace looks fine, so I think the problem is not within the SCTP stack. I do see datachannel log entries like: DataChannelOnMessageAvailable (5) with null Listener! I have to look up what this means... And need to double check the code path for DOMSTRING.
Sharchar: BTW: Are you closing data channels? Best regards Michael
(In reply to Michael Tüxen from comment #7) > Shachar: I can reproduce the problem you are describing. However, it doesn't > be related to unreliable transfer. The Wireshark trace shows no indication > that PR-SCTP is being used. > What I see is the usage of two datachannels, both transferring strings, not > binary data. On one data channel you are transferring small strings, about > 400 bytes. After the transmission on the other data channel stops, these > messages are sent every 2 seconds. So do you have an app timer running? The > other datachannel transfers messages of size about 6800 bytes. After some > number of messages (2730 in my case), > no further messages are sent on that data channel. So something is missing. > The SCTP trace looks fine, so I think the problem is not within the SCTP > stack. > > I do see datachannel log entries like: > DataChannelOnMessageAvailable (5) with null Listener! > I have to look up what this means... And need to double check the code path > for DOMSTRING. I guess you're seeing 2 data channels, because you're running both tabs in the same computer, so 1 for each. 1: "the receiver" is sending chunk requests - probably the small messages 2: "the sender" is sending the chunk data - probably the larger ones the app doesn't exactly run a timer, it requests more chunks once the earlier chunks arrived, and if they don't arrive after a while (1 second currently) there's a timer to "Expire" the chunks (In reply to Michael Tüxen from comment #8) > Sharchar: BTW: Are you closing data channels? > > Best regards > Michael the data channels aren't closed currently I'm not sure if this is the same bug, but I could reproduce the phenomenon between 2 win7-64bit computers (over wireless) with file size > 50KB in this demo: http://masweb.ics.es.osaka-u.ac.jp/~k-nkgwj/webrtc/test/multihost-datachannel/
(In reply to Shachar from comment #9) > (In reply to Michael Tüxen from comment #7) > > Shachar: I can reproduce the problem you are describing. However, it doesn't > > be related to unreliable transfer. The Wireshark trace shows no indication > > that PR-SCTP is being used. > > What I see is the usage of two datachannels, both transferring strings, not > > binary data. On one data channel you are transferring small strings, about > > 400 bytes. After the transmission on the other data channel stops, these > > messages are sent every 2 seconds. So do you have an app timer running? The > > other datachannel transfers messages of size about 6800 bytes. After some > > number of messages (2730 in my case), > > no further messages are sent on that data channel. So something is missing. > > The SCTP trace looks fine, so I think the problem is not within the SCTP > > stack. > > > > I do see datachannel log entries like: > > DataChannelOnMessageAvailable (5) with null Listener! > > I have to look up what this means... And need to double check the code path > > for DOMSTRING. > I guess you're seeing 2 data channels, because you're running both tabs in > the same computer, so 1 for each. > 1: "the receiver" is sending chunk requests - probably the small messages > 2: "the sender" is sending the chunk data - probably the larger ones Correct. Thanks for the clarification. The tracefile shows this, I can understand this now. > > the app doesn't exactly run a timer, it requests more chunks once the > earlier chunks arrived, and if they don't arrive after a while (1 second > currently) there's a timer to "Expire" the chunks > > (In reply to Michael Tüxen from comment #8) > > Sharchar: BTW: Are you closing data channels? > > > > Best regards > > Michael > the data channels aren't closed currently Great! I know that we have a bug related to closing... > > I'm not sure if this is the same bug, but I could reproduce the phenomenon > between 2 win7-64bit computers (over wireless) with file size > 50KB in this > demo: > http://masweb.ics.es.osaka-u.ac.jp/~k-nkgwj/webrtc/test/multihost- > datachannel/ Just to double check: If (for whatever reason) the small chunk request wouldn't be received anymore by your application, you wouldn't send the large ones anymore, right? My current guess is, that somehow messages received by the SCTP stack aren't delivered to the JS application anymore at some point. Not sure. Just a guess, but this would make sense from looking at the wireshark tracefile and you explanations. Best regards Michael
(In reply to Michael Tüxen from comment #10) > Just to double check: If (for whatever reason) the small chunk request > wouldn't > be received anymore by your application, you wouldn't send the large ones > anymore, > right? that's correct
OK, the logfile shows: 1961159008[100469660]: DataChannelOnMessageAvailable (5) with null Listener! This means that during the transfer, mChannel->mListener gets NULL. I don't know why. But once that happens, no messages will be delivered to the JS layer and you observe the behavior you are experiencing. Randell: Any idea why the mListener gets NULL?
mListener should be set to NULL only if the DOM object went away and was garbage-collected. Assign it to a var. There's an open bug to implement the WebSockets behavior of it not being GC'd if there's still an active listener in it.
Shachar: Can you try what Randell suggested and report if the behavior changes? If it does, could you also retry http://masweb.ics.es.osaka-u.ac.jp/~k-nkgwj/webrtc/test/multihost-datachannel/ Thanks a lot! Best regards Michael
Shachar: Retesting today with sharefest showed no problem in contrast to testing before. Did you change anything on the JS side? Best regards Michael
Yea I can't reproduce it in localhost anymore either, There are racing conditions, where more than one dataChannel is being created, so although the other DC are not used, I saved a pointer to them. (maybe that's what solved it?). I can still reproduce the problem between 2 different computers (tried on wireless) (In reply to Michael Tüxen from comment #15) > Shachar: Retesting today with sharefest showed no problem in contrast to > testing before. Did you change anything on the JS side? > > Best regards > Michael
Did some testing between Mac OS X, Windows 7 and Linux Ubuntu with Firefox Nightly. I copied a 100MB file over WLAN from each platform and it worked fine. Any idea what I can do to reproduce it? Best regards Michael
Shachar, can you provide steps to reproduce? Thanks.
Flags: needinfo?(shacharz)
Sorry for taking so long. try using a 90MB file (there's a 100MB or so limitation on sharefest right now) between 2 computers: win7 64bit, to win7 64bit over wireless both (I can reproduce it with both computers in the same wifi). This scenario leads to data drop both on both demo pages mentioned above. (In reply to Maire Reavy [:mreavy] from comment #18) > Shachar, can you provide steps to reproduce? Thanks.
Flags: needinfo?(shacharz)
Can you provide the logging as described in comment 4? That allows to figure out what is going on... Best regards Michael
I just tested transferring a 100MB file over the WLAN at the IETF and it worked fine. However, this was done between two Mac OS X machines, not Windows... Best regards Michael
Attachment of the logs from the following scenario: Using the sharefest.peer5.com application Sender (win7 64bit, connected wirely) sends a 5MB file to Receiver (win7 64bit, connected wirelessly)
Whiteboard: [webrtc][blocking-webrtc+] → [webrtc][blocking-webrtc-]
I have looked at the traces. At some point of time, both sides do not receive any SCTP packets anymore. I seems like the receive threads a somehow blocked or packets are not received anymore. This happens on both sides! However, there is no indication why this happens. This problem is strange, since I can't reproduce it (I really tried hard), I even used a Windows VM. So there must be something specific to your test setup that I didn't have. No idea what it could be... Do you have NAT boxes between the sender and receiver?
It there any chance a NAT rebooted, or an IP address changed at one end, or one end lost connectivity, etc?
Don't think so, It's 2 computers connected to the same router, 1 wired and 1 wireless.
Ok, I seem to narrowed down the problem, and I got it work now (updated in sharefest.me so it'll be harder to reproduce there). Scenario 1 (works): receiver: request 1 chunk of data sender: dc.send() the requested chunk ... and so on untill receiver has the entire file. Scenario 2 (doesn't work): receiver: request 100 chunk of data sender: dc.send() requested chunks of data 1 after the other. ... and so on untill receiver has the entire file. In scenario 2 the connection immediately drops and stops sending. (only reproduced when the sender is windows7, couldn't reproduce when sender is OSX) Also, it happens both in reliable and unreliable DC.
Aha. Please get a log with datachannel:5,sctp:5 !! I suspect a socket-buffer-overflow is aborting the association on windows
I thought that's what I did (in the attachment) (In reply to Randell Jesup [:jesup] from comment #27) > Aha. Please get a log with datachannel:5,sctp:5 !! > > I suspect a socket-buffer-overflow is aborting the association on windows
I believe we have fixed this now, and the fix will be in FF 23. Please verify if possible. Thanks!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: