Certain high-traffic uses of DataChannels/sctp fail when the packets get too large due to PMTU

RESOLVED FIXED in Firefox 41

Status

()

defect
P1
normal
Rank:
10
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: jesup, Assigned: jesup)

Tracking

42 Branch
mozilla43
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox40 wontfix, firefox41 fixed, firefox42 fixed, firefox43 fixed, firefox-esr31 wontfix, firefox-esr38 affected)

Details

Attachments

(1 attachment, 1 obsolete attachment)

Certain complex scenarios with high traffic and large numbers of channels (hundreds) appear to cause the MTU to grow (PMTU), and when it hit 1464 bytes (of SCTP payload, as decoded via SCTP:4 logging) the packets don't get through the DTLS connection.

After a number of retransmits, it decides the connection has failed and aborts the association, killing all datachannels.
backlog: --- → webrtc/webaudio+
Rank: 10
Priority: -- → P1
tested that basic functionality works; bemasc - can you retest with this patch?
Attachment #8648174 - Flags: review?(tuexen)
Assignee: nobody → rjesup
Status: NEW → ASSIGNED
Comment on attachment 8648174 [details] [diff] [review]
disable PMTU in DataChannels/SCTP

Review of attachment 8648174 [details] [diff] [review]:
-----------------------------------------------------------------

This patch doesn't set the PMTU, have a look at the patch I sent you a couple of minutes ago.
That disables the PMTUD and sets the PMTU to save value. If you don't want to set the PMTU,
you need to address the issues I marked.

::: netwerk/sctp/datachannel/DataChannel.cpp
@@ +454,5 @@
>      goto error_cleanup;
>    }
>  
> +  LOG(("Disabling SCTP PMTU"));
> +  memset(&paddrparams, 0, sizeof(paddrparams));

The next line disables Heartbeats. I suggest to call
usrsctp_getsockopt() first and then do
paddrparams.spp_flags &= !SPP_PMTUD_ENABLE;
paddrparams.spp_flags |= SPP_PMTUD_DISABLE;

@@ +455,5 @@
>    }
>  
> +  LOG(("Disabling SCTP PMTU"));
> +  memset(&paddrparams, 0, sizeof(paddrparams));
> +  paddrparams.spp_flags = SPP_PMTUD_DISABLE;

You need to set
paddrparams.spp_address.ss_family = AF_CONN;
You also want
#if defined(__Userspace_os_Darwin)
   paddrparams.spp_address.ss_len = sizeof(struct sockaddr_conn);
#endif

@@ +456,5 @@
>  
> +  LOG(("Disabling SCTP PMTU"));
> +  memset(&paddrparams, 0, sizeof(paddrparams));
> +  paddrparams.spp_flags = SPP_PMTUD_DISABLE;
> +  paddrparams.spp_assoc_id = SCTP_ALL_ASSOC;

Setting spp_assoc_id is not necessary.
Attachment #8648174 - Flags: review?(tuexen) → review-
Based on Tuexen's patch by email, with some fixes and making sure it runs if the connect succeeded without EINPROGRESS.  r+ from me for Tuexen's, r? to tuexen for my mods (though those don't really need it
Attachment #8648223 - Flags: review?(tuexen)
Attachment #8648223 - Flags: review?(tuexen) → review+
Just for clarification:
The problem is not related to high number of streams (data channels). It is just related to the fact that the SCTP association is up for more than 10 minutes. The initial path MTU is 1280 bytes, after 10 minutes path MTU discovery kicks in by a timeout and raises the path MTU to 1492 bytes. If now a user message is sent which allows messages of that size, they don't get through, since with the UDP and DTLS overhead it exceeds the MTU. Therefore the packet is dropped, even on retransmissions (SCTP relies currently on IP fragmentation). Finally the association is terminated. We are working on an improved version of PMTUD. Right now, it is best just to disable PMTUD and use a save value. This is what the patch does.
https://hg.mozilla.org/mozilla-central/rev/291adc935f33
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
Attachment #8648174 - Attachment is obsolete: true
Comment on attachment 8648223 [details] [diff] [review]
disable PMTUD in DataChannels/SCTP, set initial MTU

Approval Request Comment
[Feature/regressing bug #]: N/A

[User impact if declined]: DataChannels can fail after 10 minutes if the transport doesn't allow PMTUD to work properly.  This definitely affects uProxy, and *may* affect some TURN cases.  Simple use of TURN with datachannels in a Hello call without this patch (and width TURN forced) didn't seem to cause PMTUD to be invoked.  Other uses might, and there are scenarios (at least some unusual ones) where we believe it can fail.

[Describe test coverage new/current, TreeHerder]: Not easily testable in automation due to the network requirements and 10 minute timer.  uProxy has tested the fix successfully.  Should be manually verifiable with uProxy.

[Risks and why]: Very low risk - isolated change that disables an SCTP feature that's purely an efficiency-of-network feature, at least for discovery of larger-than-normal MTUs.  Limit set is the number from the DataChannels IETF draft.

[String/UUID change made/needed]: none.
Attachment #8648223 - Flags: approval-mozilla-beta?
Attachment #8648223 - Flags: approval-mozilla-aurora?
Comment on attachment 8648223 [details] [diff] [review]
disable PMTUD in DataChannels/SCTP, set initial MTU

Approved for uplift to Aurora and Beta. Given that the likelihood of running into this scenario is low, it seems safe to uplift to Beta directly.
Attachment #8648223 - Flags: approval-mozilla-beta?
Attachment #8648223 - Flags: approval-mozilla-beta+
Attachment #8648223 - Flags: approval-mozilla-aurora?
Attachment #8648223 - Flags: approval-mozilla-aurora+
Randell, Coverity complains about this patch:
CID 1320439 (#1 of 1): Logical vs. bitwise operator (CONSTANT_EXPRESSION_RESULT)logical_vs_bitwise: paddrparams.spp_flags &= 0U /* !8 */ always assigns 0 to paddrparams.spp_flags.
   Did you intend to use '~' rather than '!'?
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/datachannel/DataChannel.cpp#616

Correct? Do you want a new bug report?
Flags: needinfo?(rjesup)
Ugh.. yes, that looks wrong; should be ~.
Flags: needinfo?(rjesup)
Depends on: 1198730
Yes, it needs to be ~
You need to log in before you can comment on or make changes to this bug.