Closed Bug 1214269 Opened 9 years ago Closed 8 years ago

WebRTC file transfer, data channel stale

Categories

(Core :: WebRTC: Networking, defect, P2)

40 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
firefox47 --- fixed

People

(Reporter: xmo, Assigned: drno)

References

Details

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36

Steps to reproduce:

I'm developing a service to do file transfer using a WebRTC datachannel towards a Firefox client, using SCTP over DTLS 1.0.


Actual results:

For files smaller than 10kB, the transfer completes successfully.

For larger files, then after a certain amount of packets are sent (say 150, but it varies), the TSN of Firefox SACKs stops increasing (it ceils to 'tsnA'), while my service keeps retransmitting the next 2 data packets (i.e. tsnA+1 and tsnA+2), until timeout is reached.


Expected results:


I suspect the problem may come from the DTLS sent from my service.
How can I have Firefox output detailed error about invalid packets it received?

(Attached is a log file with a fair bunch of traces enabled, but no error is visible in there---NSPR_LOG_MODULES=datachannel:9,sctp:9,signaling:9,mtransport:9,jsep:9)

Thanks,
BR
Flags: needinfo?(general)
In tcsh on linux, you can convert your log into a pcapng file for wireshark, using this alias:
alias log2pcap	'grep "SCTP_PACKET" !:1 | text2pcap -D -n -l 248 -t "%H:%M:%S." - !:2'

In this case, I see TSN received by SCTP jump from ...325 to ...238 - a retransmission of a packet from seconds before, which had been SACKed immediately.

I'll attach the pcapng (though you can't see the direction of the packets, note).
Component: WebRTC → WebRTC: Networking
Flags: needinfo?(general) → needinfo?(xmo)
Looking at the tracefile it seems to me that the node at which the capture was done behaves correctly. Until frame 279 the transfer is perfect. The tracing node SACKs every other packet containing DATA chunks. Then the peer retransmits in frame 280 an already SACKed TSN and the tracing nodes reports that correctly in frame 281. After that the peer retransmits DATA chunks which where already SACKed and the tracning node reports them as duplicates. So the problem is not the SCTP part on the tracing node.
It looks like the peer hasn't processed that SACKs sent in frame 147 and lateron (to be clear: NO SACK after frame 147 has been processed by the peer). What is suspicious: There is a 1 second higup between frame 147 and frame 148. I bet this is related. So any idea what happens at the peer at that point of time?
Thank you guys,
This enabled to see it is mainly a matter of timing on the peer side.

Yes Michael, the 1-second delay corresponds to the peer deciding to start retransmissions, because basically it receives SACKs with too much delay: 1st transmission of TSN 325 is followed by SACK 237.
So it retransmits from 238...

I'll let you know if I find something worth being shared, thanks again!
/X
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Flags: needinfo?(xmo)
Resolution: --- → INVALID
Hello,
I'm back :)

Further investigation shows that Firefox handles incoming packets rather slowly (>2ms per packet).
Manually slowing down the emitting pace improves things, yet then file transfers take several minutes...

  1. Is there any mechanism or protocol parameter that I'd have missed to fit the receive pace?

Btw, I don't know the internals but it's like Firefox input packet queue does not pop if no new packet is coming in, though the queue is not empty.  Hence the hiccup when the peer waits before retransmitting: Firefox does not take advantage of emitter's pause to catch up.

  2. Am I right?  If so, is this normal behavior?


Thanks for your advice,
/X
Status: RESOLVED → UNCONFIRMED
Flags: needinfo?(general)
Resolution: INVALID → ---
I think jesup (or Michael) is the appropriate person for your concerns.
Flags: needinfo?(general) → needinfo?(rjesup)
Can't comment on the 2ms delay. This is not happening within the SCTP stack. However, it might be related to the way the packets are handling by the other parts of the code. Randell?

Just to understand what you are experiencing:
What are the two nodes communicating? Is Firefox on both sides? You did provide a tracefile from the receiver, which is Forefox right? Could you also provide such a tracefile from the sender? It would be preferable to have the two tracefiles (one from the sender, one from the receiver) based on the same data transfer.
(In reply to xmo from comment #5)
> Further investigation shows that Firefox handles incoming packets rather
> slowly (>2ms per packet).
> Manually slowing down the emitting pace improves things, yet then file
> transfers take several minutes...

Please provide traces as before for both cases (with and without pacing).  There's no reason I can think of for something like this to occur.  What version of Firefox, on what OS, and if 43 or 44, was e10s enabled?

>   1. Is there any mechanism or protocol parameter that I'd have missed to
> fit the receive pace?
> 
> Btw, I don't know the internals but it's like Firefox input packet queue
> does not pop if no new packet is coming in, though the queue is not empty. 
> Hence the hiccup when the peer waits before retransmitting: Firefox does not
> take advantage of emitter's pause to catch up.

Please indicate why you believe that has occurred - what I see in the original trace are SACKs after each incoming packet.  As michael asks, can you provide a trace from the other end showing this slow response?
Flags: needinfo?(rjesup) → needinfo?(xmo)
(In reply to Randell Jesup [:jesup] from comment #8)
> Please provide traces as before for both cases (with and without pacing). 
> There's no reason I can think of for something like this to occur.  What
> version of Firefox, on what OS, and if 43 or 44, was e10s enabled?

Version 41.02 on Windows.


(In reply to Michael Tüxen from comment #7)
> What are the two nodes communicating? Is Firefox on both sides? You did
> provide a tracefile from the receiver, which is Forefox right? 

Right, and the other end is my media service (which is not a browser).
I'll email you the capture from service side, you'll clearly see wha.

(Btw, adding Custom columns 'sctp.data_tsn' and 'sctp.sack_cumulative_tsn_ack' in Wireshark is a good tip :)


Thanks, have a good day,
/X
Flags: needinfo?(xmo)
(In reply to Randell Jesup [:jesup] from comment #8)
> Please provide traces as before for both cases (with and without pacing). 
> There's no reason I can think of for something like this to occur.  What
> version of Firefox, on what OS, and if 43 or 44, was e10s enabled?

Version 41.02 on Windows.


(In reply to Michael Tüxen from comment #7)
> What are the two nodes communicating? Is Firefox on both sides? You did
> provide a tracefile from the receiver, which is Forefox right? 

Right, and the other end is my media service (which is not a browser).
I'll email you the capture from service side, you'll clearly see what I mean.

(Btw, adding Custom columns 'sctp.data_tsn' and 'sctp.sack_cumulative_tsn_ack' in Wireshark is a good tip :)


Thanks, have a good day,
/X
NI to come back and confirm or not
Flags: needinfo?(rjesup)
Bug 1214269: read multiple DTLS packets from NSS if present. r?ekr,mt,jesup
Attachment #8682932 - Flags: review?(rjesup)
Attachment #8682932 - Flags: review?(martin.thomson)
Attachment #8682932 - Flags: review?(ekr)
Assignee: nobody → drno
Status: UNCONFIRMED → ASSIGNED
backlog: --- → webrtc/webaudio+
Rank: 25
Ever confirmed: true
Priority: -- → P2
Comment on attachment 8682932 [details]
MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r=mt,jesup

https://reviewboard.mozilla.org/r/24173/#review21661

::: media/mtransport/test/transport_unittests.cpp:808
(Diff revision 1)
> -  size_t received_;
> +  size_t received_ct_;

Nit: would prefer received_packets_ (or received_count_)

::: media/mtransport/test/transport_unittests.cpp:931
(Diff revision 1)
> -    unsigned char buf[1000];
> +    unsigned char buf[bytes];

do we want the default to be > typical MTU?  We need to test "large" packets, but most will be <= ethernet MTU, and I don't see anything forcing a size lower than the default, just larger (8192/8155).

::: media/mtransport/test/transport_unittests.cpp:1196
(Diff revision 1)
> +  TransferTest(1, 8155);

either calculate the max size, or comment as to what the calculation is (8192 - ....)

::: media/mtransport/test/transport_unittests.cpp:1202
(Diff revision 1)
> +  TransferTest(3);

do we need both this and TestTransferIce?  It does TransferTest(1), this does TransferTest(3).  I'd get rid of the other one....  but I don't care a lot.

::: media/mtransport/transportlayerdtls.cpp:994
(Diff revision 1)
> +    } while(rv > 0);

trivial nit: space after while
Attachment #8682932 - Flags: review?(rjesup) → review+
https://reviewboard.mozilla.org/r/24173/#review21661

> either calculate the max size, or comment as to what the calculation is (8192 - ....)

As it turns out that the overhead of 37 bytes for DTLS is not fixed (depends on ciphers, padding etc) and packets which are bigger then nICEr buffer size silently disappear I decided to lower the buffer to 8000 bytes instead.
Now the tests consistently test for the maximum of 8000 bytes, we will see NSS errors in case more then 8000 bytes are read from the network and I added comments in the code and test to document the reasoning for this.
Comment on attachment 8682932 [details]
MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r=mt,jesup

Bug 1214269: read multiple DTLS packets from NSS if present. r+jesup
Attachment #8682932 - Attachment description: MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r?ekr,mt,jesup → MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r+jesup
Attachment #8682932 - Flags: review?(martin.thomson)
Attachment #8682932 - Flags: review?(ekr)
To be safe another try run after my last edits. Need to check back on the results.
Flags: needinfo?(drno)
Hi X,

it would be really help if you could try again with this try build:
http://archive.mozilla.org/pub/firefox/try-builds/bind-autoland@mozilla.com-7b80597c4435019bca10cbc7c7d2fd34bdbccadb/try-win32/firefox-45.0a1.en-US.win32.zip

and let us know if this fixes the problem you see.

In case you are not using Windows you can find all binaries from my try build here: http://archive.mozilla.org/pub/firefox/try-builds/bind-autoland@mozilla.com-7b80597c4435019bca10cbc7c7d2fd34bdbccadb/
Flags: needinfo?(xmo)
Thanks Nils, 

However this does not seem to fix the behavior, sorry :(

Even worse indeed, because after the various retransmitting pauses (1 second, then 2 secs, etc.), no SACK is returned to sender, as if those retransmissions were ignored---previously, they triggered processing of next DTLS record...

BR,
/X
Flags: needinfo?(xmo)
(In reply to xmo from comment #18)
> However this does not seem to fix the behavior, sorry :(
> 
> Even worse indeed, because after the various retransmitting pauses (1
> second, then 2 secs, etc.), no SACK is returned to sender, as if those
> retransmissions were ignored---previously, they triggered processing of next
> DTLS record...

Wow that is surprising and discouraging. :-( 
We were pretty confident that he had found the problem. But maybe we have more problems then just this one. Could you please send us (to jesup, michael and me) the same logs again as last time, but from the test with the Nightly try build?

Thank you very much for helping us to debug these problems.
Try results did not show any major problem with the patch.

But I think we should hold off on this until we know about the remaining problems X experiences here.
Flags: needinfo?(drno)
Hello Nils,
As emailed, note that the "worse" problem appears only with e10s enabled.
My first tests show that disabling e10s makes FFx behave much better than before the fix.  I'll test deeper and will let you know if need be.

Thanks!
/X
So dug deeper to understand why I fail to pass 8192 bytes through the ICE layers (even though nICEr uses a buffer right of that size):

If the test writes 8192 bytes data the DTLS layer will add several bytes to that (the size varies because of used cipher etc). Eventually NSPR will write something over 8200 bytes onto the network.
On the receiving end nICEr calls PR_RecvFrom with the size (8192) of its buffer. Now the documentation of PR_RecvFrom (https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSPR/Reference/PR_RecvFrom) is a little vague:
 - "amount The size of buf (in bytes)." sound like a max value
 - "No more than amount bytes will be transferred."
But what I see when I try to run the test with the theoretical maximum is that data truncation is happening. And unfortunately that actually matches the behavior of recvfrom on Linux http://linux.die.net/man/2/recvfrom
The DTLS layer actually writes log messages about this truncation happening, but you have to have the write log options set to see them :-)

Now on Linux according to that man page we would have options to learn at least that the packet got truncated via the MSG_TRUNC flag, or in best case we could even use the MSG_PEEK flag first to figure out how big the next packet is going to be. But it does not look like if NSPR supports any of these two options.

I see these options for TransportLayerDtls::PacketReceived() right now:
- have/keep a buffer size slightly above the normal MTU, which is what we have right now with a size off 2000 bytes. Then all non standard packets will get truncated and thrown away.
- increase the buffer size to 8192 (or slightly below?) to match nICEr's internal buffer. That should allows us to handle any DTLS packets which is up to 8192 bytes payload in the UDP packet.
- increase the buffer to something bigger then 8192, e.g. 65.535 would absolutely ensure that no truncation will happen or NSPR apparently uses something around 18k internally (from what I saw in my debugger).

Ekr, Jesup: opinions what do to about the buffer size in TransportLayerDtls::PacketReceived()?
Flags: needinfo?(ekr)
Comment on attachment 8682932 [details]
MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r=mt,jesup

https://reviewboard.mozilla.org/r/24173/#review22145

My only concern is that you haven't allowed enough space to accept what nICEr produces by having a smaller buffer.  TLS will reduce the size, but maybe not by 192.  I'd rather see this accept 8192 if that is the maximum packet size that the lower layer can produce.

::: media/mtransport/transportlayerdtls.cpp:972
(Diff revision 2)
> +    unsigned char buf[8000];

So why have you chosen something smaller than 8192?
Attachment #8682932 - Flags: review+
https://reviewboard.mozilla.org/r/24173/#review22145

> So why have you chosen something smaller than 8192?

Just read the comment above.  If the intent is to have something that fails more often, then you need to say that, and justify it.  I would have expected that you would want to fail less, not more.
https://reviewboard.mozilla.org/r/24173/#review22145

> Just read the comment above.  If the intent is to have something that fails more often, then you need to say that, and justify it.  I would have expected that you would want to fail less, not more.

I'm confused by this too.

The way this works is that DTLS-protection expands stuff, so DTLS-unprotection removes stuff, but by an unpredictable value.

It would best to write packets of size X, then read buffers of X + 256 or so, both from ICE and from DTLS.
(In reply to Eric Rescorla (:ekr) from comment #25)
> I'm confused by this too.
> 
> The way this works is that DTLS-protection expands stuff, so
> DTLS-unprotection removes stuff, but by an unpredictable value.
> 
> It would best to write packets of size X, then read buffers of X + 256 or
> so, both from ICE and from DTLS.

Please ignore the 8000 in the patch. I arbitrarily picked that value when I wrote the patch, without knowing what exactly is going on.
After that I did the investigation described in comment #22. I think the three options here are:
1) stick something slightly above the regular MTU of 1500 bytes - which basically breaks jumbo frames as the PR_recvfrom() calls will truncate them and the DTLS layer will drop them.
2) align the buffer in TransportLayerDtls::PacketReceived() with the nICEr one to 8192 bytes. That way bigger frames then the 2000 bytes, our current limit, should work. Anything bigger then 8192 bytes would still get truncated by the PR_recvfrom() call and discard by the DTLS layer.
3) increase the buffer in TransportLayerDtls::PacketReceived() and in nICEr to something bigger then 8192. According to Wikipedia is sounds like 9126 bytes would a reasonable choice to support all the different jumbo frame implementations on the market today.

Not sure if it is "risky" in any way to bump the nICEr buffer size from 8192 to 9126. Depending on that I would pick option 3 or 2.

In any case the TestTransferMaxSize test for direct socket connect can test with the maximum payload size, because it uses the loopback implementation to just pass the packets with the additional DTLS overhead from side to the other. But for the TestTransferIceMaxSize test I'll always have to send slightly below the maximum buffer size to avoid that the additional bytes added by the DTLS layer don't get truncated through the PR_recvfrom() call on the receiving side.
Attachment #8682932 - Attachment description: MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r+jesup → MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r?jesup,mt,ekr
Attachment #8682932 - Flags: review?(ekr)
Comment on attachment 8682932 [details]
MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r=mt,jesup

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24173/diff/2-3/
https://reviewboard.mozilla.org/r/24173/#review22145

> I'm confused by this too.
> 
> The way this works is that DTLS-protection expands stuff, so DTLS-unprotection removes stuff, but by an unpredictable value.
> 
> It would best to write packets of size X, then read buffers of X + 256 or so, both from ICE and from DTLS.

Commented on the details in the bug.

Increased the buffer in nICEr and transportlayerdtls in the latest patch to 9126 bytes to support all standard jumbo frames.
Flags: needinfo?(ekr)
Maybe I'm missing something, but I don't think that these numbers make much sense.

To recap, there is some maximum size of UDP payloads which can go over the wire, which
is dictated by some combination of IP/Ethernet, etc. Call that M. Now, when you
write a DTLS record, it has some expansion, E. Since the total packet has to
fit in M, that means that the maximum buffer you can pass to DTLS to write to
the network ~= M - E. For safety, it's easiest to estimate the maximum plausible
valuable of E at 256 or so and then write stuff smaller than M - max(E).

Conversely, when you *read* from the network, the biggest packet you can read
is M. So, you should pass a buffer of M to nr_socket_recvfrom(). And because
you don't know E with certainty, it's easiest to assume it's 0 and thus read
M bytes from NSS with PR_Recv(). But all of this is driven off of M, which is
an IP/Ethernet property.

The bottom line, then, is that 9126 doesn't seem like a sensible number.
My assumption here is that we want to support packets for a bigger then 1500 bytes MTU. If not we simply leave the buffer and let the DTLS layer discard all truncated packets for us.

No if we want to support jumbo frames I took the number 9126 just as the upper boundary for jumbo frames. If someone knows a better value for supporting jumbo frames I'm happy to use that instead.

For the sending side I thought that we have any size restriction in place. It turns out that when I try to send 9500 bytes PR_sendto() returns an error which seems to translate to "message to big". Which is really strange, because the transport_unittest connects over loopback and therefore should allow packets up to 16k in size. According to the sendto() man page the application is suppose to handle this error and retry with smaller packet. Although I have no clue how the application is suppose to figure out a save size for the retry.
Should I file a follow up bug for that so that it is at least documented?

Now for the buffer sizes in this patch: they only deal with the receiving side. So basically I'm trying to set M-0 here to 9126 bytes according to the wikipedia source. If someone knows a better value please let me know.

If now 9126 still does not make sense, what else should we use for the receiving buffer sizes?
So it turns out that the limit for sending UDP packets is a Mac specific limitation. Mac OSX seems to have this set by default:
  net.inet.udp.maxdgram: 9216
That seems to be the reason the the kernel call to sendto() done by PR_sendto() returns an error as soon as I go over 9216 bytes.
On Linux sending 9500 bytes works just fine, because at least my Ubuntu has no such limit set.

Just for future reference: in case we wanted to send more the 9216 bytes it looks like that limit can be raised via a setsckopt() call if needed.

Transport and ICE unit test are passing under ASAN with the increased buffer sizes.
Comment on attachment 8682932 [details]
MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r=mt,jesup

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24173/diff/3-4/
Attachment #8682932 - Attachment description: MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r?jesup,mt,ekr → MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r+ekr
Lets actually use 9216 bytes instead of 9126 :-)
See Also: → 797432
Attachment #8682932 - Attachment description: MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r+ekr → MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r=mt,jesup
Attachment #8682932 - Flags: review?(ekr)
Comment on attachment 8682932 [details]
MozReview Request: Bug 1214269: read multiple DTLS packets from NSS if present. r=mt,jesup

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/24173/diff/4-5/
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/8cd6b426846c
Status: ASSIGNED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Flags: needinfo?(rjesup)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: