Closed Bug 1562341 Opened 4 months ago Closed 3 months ago

Regression causing WebRTC data channel establishment to fail

Categories

(Core :: WebRTC, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox68 --- unaffected
firefox69 --- fixed
firefox70 --- fixed

People

(Reporter: mquander, Assigned: bwc)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

In Nightly builds of Firefox more recent than 2019-06-20, data channels will fail to connect in circumstances where they would previously connect. An example that is easy to test is https://janus.conf.meetecho.com/textroomtest.html (if you click "Start" and the data channel is successfully established with Janus, the UI will prompt you for a display name.) In recent versions, the data channel will never transition from "connecting" to "open".

Examples where Nightly is negotiating with itself seem to work, e.g. https://mozilla.github.io/webrtc-landing/data_test.html.

Nightly 2019-06-20 is definitely working and 2019-06-21 is definitely broken. Strangely, mozregression seemed to indicate that the very first commit after the 2019-06-20 build (https://hg.mozilla.org/integration/autoland/rev/5392c325d0229333a4479fd4a993633403960cc1) was broken, but it's hard for me to believe. There were several other commits between 2019-06-20 and 2019-06-21 that appear potentially more relevant, related to https://bugzilla.mozilla.org/show_bug.cgi?id=1556795.

Component: General → WebRTC
Product: Firefox → Core
Keywords: regression

mozregression stopped at this pushlog for me https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=4cde299454c9851a32d555ce7004f2f69457ecb3&tochange=b8c319d5968f957d34e359829690614a9b69b95a

After that it failed to continue to find anything from autoland.

Since the pushlog only contains one bugfix for datachannels bug 1556795 appears to be the most likely root cause here.

Byron can you please have a look?

Assignee: nobody → docfaraday
Flags: needinfo?(docfaraday)
Priority: -- → P1
Regressions: 1556795

This is happening because Janus is rejecting our attempt to negotiate more datachannel streams. We might be gratuitously requesting this, however. Looking into it.

Flags: needinfo?(docfaraday)

Check back on try.

Flags: needinfo?(docfaraday)

Same here. As a curiosity, prior to v69 Firefox negotiates 256 SCTP outbound streams and 2048 inbound streams:

This is in the server side:
"SCTP association connected, streams [in:256, out:2048]"

However in v69:
"SCTP association connected, streams [in:16, out:2048]"

That build seems to me to resolve the problem.

This is happening because Janus is rejecting our attempt to negotiate more datachannel streams.

Byron, this is not what I see in my server which tries to negotiate 65535 inbound and outbound streams. During the SCTP association handshake I do not see any attempt from Firefox to increase those values (NOTE: I've not tested the new build referenced above).

Here the full SCTP packets during the SCTP handshake:

  1. Server to Firefox:

Frame 1: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
Stream Control Transmission Protocol, Src Port: 5000 (5000), Dst Port: 5000 (5000)
Source port: 5000
Destination port: 5000
Verification tag: 0x00000000
[Assocation index: 0]
Checksum: 0xac9a93d2 (not verified)
INIT chunk (Outbound streams: 65535, inbound streams: 65535)
Chunk type: INIT (1)
Chunk flags: 0x00
Chunk length: 86
Initiate tag: 0xb2652529
Advertised receiver window credit (a_rwnd): 131072
Number of outbound streams: 65535
Number of inbound streams: 65535
Initial TSN: 1757173820
Forward TSN supported parameter
Supported Extensions parameter (Supported types: FORWARD_TSN, AUTH, ASCONF, ASCONF_ACK, RE_CONFIG)
Random parameter
Requested HMAC Algorithm parameter (Supported HMACs: SHA-1)
Authenticated Chunk list parameter (Chunk types to be authenticated: ASCONF_ACK, ASCONF)
Chunk padding: 0000

  1. Firefox to Server:

Frame 1: 100 bytes on wire (800 bits), 100 bytes captured (800 bits)
Stream Control Transmission Protocol, Src Port: 5000 (5000), Dst Port: 5000 (5000)
Source port: 5000
Destination port: 5000
Verification tag: 0x00000000
[Assocation index: 0]
Checksum: 0x0af74e26 (not verified)
INIT chunk (Outbound streams: 16, inbound streams: 2048)
Chunk type: INIT (1)
Chunk flags: 0x00
Chunk length: 86
Initiate tag: 0x333ccc3c
Advertised receiver window credit (a_rwnd): 1048576
Number of outbound streams: 16
Number of inbound streams: 2048
Initial TSN: 2673829545
Forward TSN supported parameter
Supported Extensions parameter (Supported types: FORWARD_TSN, AUTH, ASCONF, ASCONF_ACK, RE_CONFIG)
Random parameter
Requested HMAC Algorithm parameter (Supported HMACs: SHA-1)
Authenticated Chunk list parameter (Chunk types to be authenticated: ASCONF_ACK, ASCONF)
Chunk padding: 0000

  1. Server to Firefox:

Frame 1: 412 bytes on wire (3296 bits), 412 bytes captured (3296 bits)
Stream Control Transmission Protocol, Src Port: 5000 (5000), Dst Port: 5000 (5000)
Source port: 5000
Destination port: 5000
Verification tag: 0x333ccc3c
[Assocation index: 0]
Checksum: 0x2cd77cd4 (not verified)
INIT_ACK chunk (Outbound streams: 2048, inbound streams: 65535)
Chunk type: INIT_ACK (2)
Chunk flags: 0x00
Chunk length: 400
Initiate tag: 0xb2652529
Advertised receiver window credit (a_rwnd): 131072
Number of outbound streams: 2048
Number of inbound streams: 65535
Initial TSN: 1757173820
Forward TSN supported parameter
Supported Extensions parameter (Supported types: FORWARD_TSN, AUTH, ASCONF, ASCONF_ACK, RE_CONFIG)
Random parameter
Requested HMAC Algorithm parameter (Supported HMACs: SHA-1)
Authenticated Chunk list parameter (Chunk types to be authenticated: ASCONF_ACK, ASCONF)
State cookie parameter (Cookie length: 308 bytes)

  1. Firefox to Server:

Frame 1: 412 bytes on wire (3296 bits), 412 bytes captured (3296 bits)
Stream Control Transmission Protocol, Src Port: 5000 (5000), Dst Port: 5000 (5000)
Source port: 5000
Destination port: 5000
Verification tag: 0xb2652529
[Assocation index: 0]
Checksum: 0x04eb1eaf (not verified)
INIT_ACK chunk (Outbound streams: 16, inbound streams: 2048)
Chunk type: INIT_ACK (2)
Chunk flags: 0x00
Chunk length: 400
Initiate tag: 0x333ccc3c
Advertised receiver window credit (a_rwnd): 1048576
Number of outbound streams: 16
Number of inbound streams: 2048
Initial TSN: 2673829545
Forward TSN supported parameter
Supported Extensions parameter (Supported types: FORWARD_TSN, AUTH, ASCONF, ASCONF_ACK, RE_CONFIG)
Random parameter
Requested HMAC Algorithm parameter (Supported HMACs: SHA-1)
Authenticated Chunk list parameter (Chunk types to be authenticated: ASCONF_ACK, ASCONF)
State cookie parameter (Cookie length: 308 bytes)

  1. Server to Firefox:

Frame 1: 324 bytes on wire (2592 bits), 324 bytes captured (2592 bits)
Stream Control Transmission Protocol, Src Port: 5000 (5000), Dst Port: 5000 (5000)
Source port: 5000
Destination port: 5000
Verification tag: 0x333ccc3c
[Assocation index: 0]
Checksum: 0x831c4c95 (not verified)
COOKIE_ECHO chunk (Cookie length: 308 bytes)
Chunk type: COOKIE_ECHO (10)
Chunk flags: 0x00
Chunk length: 312
Cookie: 4b414d452d42534420312e31000000008a881a5d00000000...

  1. Firefox to Server:

Frame 1: 324 bytes on wire (2592 bits), 324 bytes captured (2592 bits)
Stream Control Transmission Protocol, Src Port: 5000 (5000), Dst Port: 5000 (5000)
Source port: 5000
Destination port: 5000
Verification tag: 0xb2652529
[Assocation index: 0]
Checksum: 0x1be2cb60 (not verified)
COOKIE_ECHO chunk (Cookie length: 308 bytes)
Chunk type: COOKIE_ECHO (10)
Chunk flags: 0x00
Chunk length: 312
Cookie: 4b414d452d42534420312e31000000008a881a5d00000000...

I can still reproduce this with my testing tool. Just check Offering, click on Start and look at the browser log.

[Child 9446: Socket Thread]: D/DataChannel Deleting DataChannelConnection 0x7f22e126a400
[Child 9446: Main Thread]: D/DataChannel Constructor DataChannelConnection=0x7f22dd4ccc00, listener=0x7f22dd4cb408
[Child 9446: Main Thread]: D/DataChannel Maximum message size (outgoing data): 2147483637 (set=no, enforced=yes)
[Child 9446: Main Thread]: D/DataChannel Setting number of SCTP streams to 256, was 10/2048
[Child 9446: Main Thread]: D/DataChannel Registered 0x7f22dd4ccc00 within the SCTP stack.
[Child 9446: Main Thread]: D/DataChannel DC Open: label cat-noises/, type 0, inorder 1, prValue 0, listener (nil), context (nil), external: true, stream 0
[Child 9446: Main Thread]: D/DataChannel Inserting channel 0 : 0x7f22e12ba670
[Child 9446: Main Thread]: D/DataChannel Setting number of SCTP streams to 16, was 256/2048
[Child 9446: Main Thread]: D/DataChannel Queuing channel 0x7f22e12ba670 (0) to finish open
[Child 9446: Main Thread]: D/DataChannel Init: origin = null

This is before ICE even kicked in. For whatever reason, the amount of SCTP streams to be negotiated is overriden by 16. IIRC this happens in the DataChannelConnection::OpenFinish function. It might be that this is related to an out-of-band negotiated data channel being created early. In that case, I would recommend adding a test case since oob negotiated channels have been a source of bugs in this implementation pretty often.

May I add another note that rtcweb-data-channel-13 says that the maximum amount of streams SHOULD be announced. The only reason we don't is early memory allocation. Thus, don't expect stream renegotiation support. I would say the majority of implementations don't have implemented support for it and there's actually no reason to do it once memory is allocated lazily. In addition: Please, if you have the resources, support the usrsctp team with a contribution by for example resolving this issue.

Lennart, indeed we create the DataChannel if Firefox before event the ICE+DTLS is connected. Do you mean that this may cause the issue above?

No, I'm now waiting for PeerConnection "iceconnectionstatechange" event to be "connected" (and also waiting for some seconds after it) before I create the DataChannel in Firefox 69.0a1 (2019-07-02), and it produces the same traces as above (streams [in:16, out:2048]) and its readyState gets stuck in "connecting", making impossible to use write() (however, funny thing is that it can receive messages over that DataChannel in "connected" readyState...).

BTW: does it make sense that, regardless the DTLS roles, each endpoint uses usrsctp_connect() so both endpoints send a SCTP INIT and both reply SCTP INIT+ACK? or should just be a single SCTP INIT in a direction determined by DTLS roles? (sorry if this is off-topic, but I see that most implementations use usrsctp_connect() anyway so we get INIT and INIT+ACK in both directions and wonder if it's related to this issue).

BTW: does it make sense that, regardless the DTLS roles, each endpoint uses usrsctp_connect() so both endpoints send a SCTP INIT and both reply SCTP INIT+ACK? or should just be a single SCTP INIT in a direction determined by DTLS roles? (sorry if this is off-topic, but I see that most implementations use usrsctp_connect() anyway so we get INIT and INIT+ACK in both directions and wonder if it's related to this issue).

This is fine and a not well-known feature of SCTP called simultaneous open, making the code quite a lot simpler. Definitely not related to this issue.

(In reply to Lennart Grahl from comment #9)

I can still reproduce this with my testing tool. Just check Offering, click on Start and look at the browser log.

[Child 9446: Socket Thread]: D/DataChannel Deleting DataChannelConnection 0x7f22e126a400
[Child 9446: Main Thread]: D/DataChannel Constructor DataChannelConnection=0x7f22dd4ccc00, listener=0x7f22dd4cb408
[Child 9446: Main Thread]: D/DataChannel Maximum message size (outgoing data): 2147483637 (set=no, enforced=yes)
[Child 9446: Main Thread]: D/DataChannel Setting number of SCTP streams to 256, was 10/2048
[Child 9446: Main Thread]: D/DataChannel Registered 0x7f22dd4ccc00 within the SCTP stack.
[Child 9446: Main Thread]: D/DataChannel DC Open: label cat-noises/, type 0, inorder 1, prValue 0, listener (nil), context (nil), external: true, stream 0
[Child 9446: Main Thread]: D/DataChannel Inserting channel 0 : 0x7f22e12ba670
[Child 9446: Main Thread]: D/DataChannel Setting number of SCTP streams to 16, was 256/2048
[Child 9446: Main Thread]: D/DataChannel Queuing channel 0x7f22e12ba670 (0) to finish open
[Child 9446: Main Thread]: D/DataChannel Init: origin = null

This is before ICE even kicked in. For whatever reason, the amount of SCTP streams to be negotiated is overriden by 16. IIRC this happens in the DataChannelConnection::OpenFinish function. It might be that this is related to an out-of-band negotiated data channel being created early. In that case, I would recommend adding a test case since oob negotiated channels have been a source of bugs in this implementation pretty often.

I can see if I can bring that number back up to 256.

Try looks good.

Flags: needinfo?(docfaraday)

To clarify: I did try it with the patch and the above posted log output was the result.

(In reply to Byron Campen [:bwc] from comment #13)

I can see if I can bring that number back up to 256.

May I know what the purpose of just announcing 256 outbound SCTP streams is? Given that DataChannels are pair of bidirectional SCTP streams with same id, and given that implementations do not typically support adding new streams (but just resetting them on demand), announcing out:256, in:2048 means that Firefox will be limited to just 256 DataChannels.

Also, if Firefox announces up to 2048 streams (just true for inbound streams anyway), why does it allow creating negotiated DataChannels with id up to 65534 (the maximum value as per spec)?

dc = peerconnection.createDataChannel('foo', { negotiated:true, id:65534 });

// => RTCDataChannel {
//   label: "foo", 
//   negotiated: true, 
//   ordered: true, 
//   reliable: true, 
//   maxPacketLifeTime: null, 
//   maxRetransmits: null, 
//   protocol: "", 
//   id: 65534, 
//   readyState: "connecting", 
//   bufferedAmount: 0
// }

(In reply to Iñaki Baz Castillo from comment #16)

(In reply to Byron Campen [:bwc] from comment #13)

I can see if I can bring that number back up to 256.

May I know what the purpose of just announcing 256 outbound SCTP streams is? Given that DataChannels are pair of bidirectional SCTP streams with same id, and given that implementations do not typically support adding new streams (but just resetting them on demand), announcing out:256, in:2048 means that Firefox will be limited to just 256 DataChannels.

Until recently (the bug that caused this regression), we pre-allocated an array big enough to hold pointers to the full compliment of streams. There may also be some memory overhead in libusrsctp; Lennart, do you know if this is the case?

Also, if Firefox announces up to 2048 streams (just true for inbound streams anyway), why does it allow creating negotiated DataChannels with id up to 65534 (the maximum value as per spec)?

dc = peerconnection.createDataChannel('foo', { negotiated:true, id:65534 });

// => RTCDataChannel {
//   label: "foo", 
//   negotiated: true, 
//   ordered: true, 
//   reliable: true, 
//   maxPacketLifeTime: null, 
//   maxRetransmits: null, 
//   protocol: "", 
//   id: 65534, 
//   readyState: "connecting", 
//   bufferedAmount: 0
// }

Good question. I am not sure why we have a max of 2048. Jesup?

Flags: needinfo?(rjesup)
Flags: needinfo?(lennart.grahl)

The problem is the amount of allocated memory (see also my earlier comment above). To quote Taylor Brandstetter in the corresponding Chromium issue:

We currently only negotiate 1024. I believe the main barrier is memory usage; usrsctp allocates 104 bytes (on a 64-bit system) for each stream, regardless of whether or not it's used, so negotiating the maximum number of streams would waste ~6MB.

This would be 6 MiB for each RTCPeerConnection. On top, you have to add the amount needed by the data channel library itself (if it does not allocate lazily). You can imagine what would happen when you fire up something like webtorrent.. on a mobile device. :)

Hence my request to get it fixed in usrsctp.

Flags: needinfo?(lennart.grahl)

And, yes, Firefox should fail when calling createDataChannel with an ID that is greater or equal to sctp.maxChannels, as described in step 20.

(In reply to Lennart Grahl from comment #19)

And, yes, Firefox should fail when calling createDataChannel with an ID that is greater or equal to sctp.maxChannels, as described in step 20.

We don't support RTCSctpTransport right now.

maxChannels is just

the minimum of the negotiated amount of incoming and outgoing SCTP streams.

And we have that information. :)

Attachment #9075201 - Attachment description: Bug 1562341: Set mNegotiatedIdLimit when the channel opens. r?mjf → Bug 1562341: Set mNegotiatedIdLimit when the channel opens, and remove some code that was stomping our initially chosen max streams. r?mjf

I understand the memory issue. However, which is the status of this issue then? Some questions:

  1. Assuming that the only difference between FF 67 (it works) and FF 69 (readyState is "connecting" forever) is that in 67 FF indicates out:256,in:2048, and in 69 FF indicates out:12,in:2048, and assuming that the trace in #c8 is similar in both cases with just that difference, why does it work in 67 (readyState becomes "open")?

  2. Is there any issue to fix in usrsctp regarding this? May be usrsctp does not honor the out:16 of FF 69 and places in:65535 in the INIT_ACK and FF does not like it?

  3. Is there any issue to fix in FF itself?

  4. How is Firefox supposed to work if more that 16 DCs are desired by the app? Will it use SCTP_ADD_STREAM request so the remote peer must allow it? (currently we reject such a request).

  5. Which exact behavior does https://phabricator.services.mozilla.com/D36517 change?

(In reply to Lennart Grahl from comment #21)

maxChannels is just

the minimum of the negotiated amount of incoming and outgoing SCTP streams.

And we have that information. :)

What about pre-negotiation?

(In reply to Iñaki Baz Castillo from comment #22)

I understand the memory issue. However, which is the status of this issue then? Some questions:

  1. Assuming that the only difference between FF 67 (it works) and FF 69 (readyState is "connecting" forever) is that in 67 FF indicates out:256,in:2048, and in 69 FF indicates out:12,in:2048, and assuming that the trace in #c8 is similar in both cases with just that difference, why does it work in 67 (readyState becomes "open")?

  2. Is there any issue to fix in usrsctp regarding this? May be usrsctp does not honor the out:16 of FF 69 and places in:65535 in the INIT_ACK and FF does not like it?

  3. Is there any issue to fix in FF itself?

  4. How is Firefox supposed to work if more that 16 DCs are desired by the app? Will it use SCTP_ADD_STREAM request so the remote peer must allow it? (currently we reject such a request).

This is what we attempt to do, yes.

  1. Which exact behavior does https://phabricator.services.mozilla.com/D36517 change?
  1. We avoid using SCTP_ADD_STREAM until the initial open is complete, and
  2. We remember what was negotiated during open so we don't SCTP_ADD_STREAM gratuitously (which is why we were hanging in connected)

We knew at the start that usrsctp had the memory allocation issue - and originally it seemed wasteful (especially on low-end mobile 6 years ago!) to preallocate even 1024 sctp streams, let alone 65534 per use of DataChannels, at least until that was resolved. And it was expected (and likely is true today) that most users of Datachannels use <10 (and I'll be 1 is the most common by far). Rarely would it hit 256 - mostly things like uProxy. We bumped from 16 to 256 when we found that Chrome hadn't bothered supporting ADD_STREAM (we supported it from day one), IIRC.

IIRC I fought the 65534 issue in the spec committee, and Google's winning argument was that this was just a detail that would be fixed upstream and so we could simplify things for people by not negotiating the number of channels. Which is reasonable, if there actually were plans to (help) fix upstream - which I don't think they've attempted. Similarly the whole nData issue and fragmentation/transfer sizes. Instead they put in a fixed limit and forgot about it, I believe.

Flags: needinfo?(rjesup)

And it was expected (and likely is true today) that most users of Datachannels use <10 (and I'll be 1 is the most common by far). Rarely would it hit 256

Well, in a SFU this is different. We are implementing DataChannel in mediasoup (mostly done) and we use a separate DataChannel for receiving data from each different participant, so we may really need much more than 16 channels (at least for receiving!).

Oh! I have it working in my server with this change!

  // Set SCTP_ENABLE_STREAM_RESET.
  struct sctp_assoc_value av;

-  av.assoc_value = SCTP_ENABLE_RESET_STREAM_REQ;
+  av.assoc_value = SCTP_ENABLE_RESET_STREAM_REQ | SCTP_ENABLE_RESET_ASSOC_REQ | SCTP_ENABLE_CHANGE_ASSOC_REQ;
 
  ret = usrsctp_setsockopt(this->socket, IPPROTO_SCTP, SCTP_ENABLE_STREAM_RESET, &av, sizeof(av));

In fact I see this in the logs:

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:2048]
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:2048, flags:0]

Now I see that datachannel.readyState becomes "open" :)

So basically SCTP_ENABLE_CHANGE_ASSOC_REQ is required as per https://tools.ietf.org/html/rfc6525#section-6.3:

SCTP_ENABLE_CHANGE_ASSOC_REQ: Process received Add Outgoing
Streams Requests if this flag is set; deny them if not.

Since this issue was originally reported for Janus, the problem is that Janus does this:

av.assoc_value = 1;

so it's just enabling SCTP_ENABLE_RESET_STREAM_REQ but it's not enabling SCTP_ENABLE_CHANGE_ASSOC_REQ.

With the patch in #c28 applied in my server I see the following behavior in FF 69:

  • I can create up to 32 DataChannels, no matter they are in the streamId range 0-31 or 100-131 or whatever. This is strange for me since I assumed that the outbound SCTP streams that a SCTP endpoint can create are in the range 0-OS, and in FF 69 OS is 32 after the SCTP_ENABLE_CHANGE_ASSOC_REQ:
OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:2048, flags:0]
  • However, if I try to have more that 32 opened DataChannels, then Firefox just fails with a very cool error:
failed:DOMException: "The operation failed for an operation-specific reason" 

In #c24 above, Byron replied this:

How is Firefox supposed to work if more that 16 DCs are desired by the app? Will it use SCTP_ADD_STREAM request so the remote peer must allow it? (currently we reject such a request).

This is what we attempt to do, yes.

I just want to confirm that, when I've tried to create the 33th DataChannel in FF, I've not seen any SCTP log in my server related to SCTP_ENABLE_CHANGE_ASSOC_REQ, so I'd say that Firefox is NOT trying to enable more simultaneous SCTP channels.

I'm honestly fine with "just" having 32 outbound SCTP streams in Firefox. However, I do need much more than 32 inbound SCTP streams and, regardless FF negotiates "out:23, in:2048", DataChannels are two bidirectional SCTP streams with same id, so FF is limited to just have 32 opened DataChannels in total. Is this true? (please, say me that it's not).

(In reply to Iñaki Baz Castillo from comment #29)

With the patch in #c28 applied in my server I see the following behavior in FF 69:

  • I can create up to 32 DataChannels, no matter they are in the streamId range 0-31 or 100-131 or whatever. This is strange for me since I assumed that the outbound SCTP streams that a SCTP endpoint can create are in the range 0-OS, and in FF 69 OS is 32 after the SCTP_ENABLE_CHANGE_ASSOC_REQ:
OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:2048, flags:0]
  • However, if I try to have more that 32 opened DataChannels, then Firefox just fails with a very cool error:
failed:DOMException: "The operation failed for an operation-specific reason" 

What API call fails with this error? Right now, the webrtc code throws OperationError only when certain things go wrong in JSEP (except for one case related to peerIdentity). Does this happen when testing against stuff other than Janus? Can you reproduce this in jsfiddle?

Flags: needinfo?(ibc)

Try looks good.

Ok, something to investigate. Indeed I've created a jsfiddle (open the browser console to see the logs please):

https://jsfiddle.net/ibcaliax/u3rx78qb/

and indeed FF 69 can create up to 2048 DataChannels (negotiated: true with incremental id).

Must investigate why it fails after 32 Datachannels when connected to my server (it's not Janus BTW).

Flags: needinfo?(ibc)

(In reply to Iñaki Baz Castillo from comment #32)

Ok, something to investigate. Indeed I've created a jsfiddle (open the browser console to see the logs please):

https://jsfiddle.net/ibcaliax/u3rx78qb/

and indeed FF 69 can create up to 2048 DataChannels (negotiated: true with incremental id).

Must investigate why it fails after 32 Datachannels when connected to my server (it's not Janus BTW).

Ok. Definitely let me know if you find Firefox doing something broken here.

Sorry, no idea what happened yesterday here (too many active things) but indeed I can also create NNNN DataChannels having FF 69 connected to my server.

As a curiosity, NNNN is 1024 (if fails after it with "NS_ERROR_UNEXPECTED") instead of 2048. No idea why yet but I assume it's unrelated.

ok, the 1024 v2 2048 is a stupid bug in my client code, which limits itself to 1024 due to an error.

Everything looks fine then.

Cool. Glad the renegotiation works; we still should bump the default back to what it was at least. And then deal with either upstream fixes for mem use, and/or making this work well with Chrome and their code (which last I checked didn't support SCTP_ENABLE_CHANGE_ASSOC_REQ, and limited to 2048)

(In reply to Byron Campen [:bwc] from comment #23)

(In reply to Lennart Grahl from comment #21)

maxChannels is just

the minimum of the negotiated amount of incoming and outgoing SCTP streams.

And we have that information. :)

What about pre-negotiation?

2048 (MAX_NUM_STREAMS) is the upper limit of Firefox atm and that is known at any given time. Sure, once connected, it must also close all channels that are beyond the actually negotiated amount of streams.

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #36)

Cool. Glad the renegotiation works; we still should bump the default back to what it was at least. And then deal with either upstream fixes for mem use, and/or making this work well with Chrome and their code (which last I checked didn't support SCTP_ENABLE_CHANGE_ASSOC_REQ, and limited to 2048)

Yup, please bump back to 256. And then preferably an upstream fix to the allocation problem regarding the amount of streams... it's extremely unlikely Google will do it (I'll spare you the rant on Google, Randell already broached the subject) and it's simply not high priority for the usrsctp team. I know that it will be annoying to deal with since it must also work in the kernel but perhaps Michael accepts a patch that only adds the lazy allocation behaviour for the userspace.

Well, let me come back into this.

FF 69 does not allow creating DataChannels with id > 31. Well, it allows it since I do it with negotiated: true and FF does not complain (although it should not allow it). But then, when calling dc.send(DATA) it fails with:

OperationError: The operation failed for an operation-specific reason

So, in summary, it seems that FF 69 initially negotiates 16 outbound SCTP streams:

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:2048]

(note: those are the logs in the server so "in" means FF outgoing streams and "out" means FF incoming streams).

Then FF renegotiates the association by incrementing the number of SCTP outgoing streams to 32;

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:2048, flags:0]

So that's all. Somehow FF 69 allows 0-31 outgoing SCTP streams and 0-2047 incoming SCTP streams. Since DataChannels are bidirectional SCTP streams, this means that FF 69 limits itself to 32 DataChannels (with id in 0..31). Period.

So no, FF does not try to "add streams" once those 32 outgoing SCTP streams are used, not at all.

This is very sad. Chrome limits the SCTP message size and so on, but at least it allows the application create up to 1024 DataChannels (from id 0 to 1023). Currently in FF 69 just 32 DataChannels can be simultaneously opened. Not enough at all for a SFU scenario.

Wow, so this depends on the number of SCTP streams announced by the remote peer (in this case my SFU). Those are server side usrsctp event logs:

1. The SFU announces 65535 in/out SCTP streams:

RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:2048]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:2048, flags:0]

(so just 32 DataChannels in FF !!! WHAT?)

2. The SFU announces 2048 in/out SCTP streams:

RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:2048]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:2048, flags:0]

(so just 32 DataChannels WHAT??)

3. The SFU announces 1024 in/out SCTP streams:

RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:1024]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:1024, flags:0]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:1024, out:1024, flags:0]

(so 1024 DataChannels in FF, it honors the server value!)

4. The SFU announces 256 in/out SCTP streams:

RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:256]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:256, flags:0]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:256, out:256, flags:0]

(so 256 DataChannels, it honors the server value)

5. The SFU announces 128 in/out SCTP streams:

RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:128]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:128, flags:0]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:128, out:128, flags:0]

(so 128 DataChannels in FF, it honors the server value)

Please, take a look to these logs. Basically, if the remote peer announces more than 1024 SCTP in/out SCTP streams then Firefox just does a "SCTP add streams" and "upgrades" from 16 to 32.

If the remote peer announces less or equal than 1024 SCTP in/out SCTP streams then Firefox honors it and performs two "SCTP add streams" CONFIG_RE and "upgrades" from 16 to 32 and then from 32 to N.

This is crazy.

And super funny:

6. The SFU announces 2047 in/out SCTP streams:

RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:2047] 
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:2047, flags:0]
RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:33, out:2047, flags:0]

(so 33 DataChannels in FF !!!! How is this??)

In order to show how our SFU does the SCTP handshake, this is the exact code:

https://github.com/versatica/mediasoup/blob/datachannel/worker/src/RTC/SctpAssociation.cpp?ts=2#L108

(just replace numSctpStreams with the values given above in each section).

What is the rationale of FF on advertising different in/out stream numbers at all? That looks like a nonsense given that a DataChannel is bidirectional.

Any reason for that?

Any reason for that?

Self response: in order to be able to do lazy allocation.

Works fine for me now with the second patch. RAWRTC sayeth:

sctp-transport: Association change: SCTP_COMM_UP, streams (in/out) = (256/2048), supports PR MULTIBUF RE-CONFIG

Iñaki, have you tested that with the most recent changeset?

Iñaki, have you tested that with the most recent changeset?

No idea. Is it supposed to be already in latest Nightly?

(NOTE: No idea if I have this patch applied in my Nightly or not)

Regarding my logs in comment 39 above, we have verified that, depending on the initmsg.sinit_max_instreams value of the remote peer (my SFU in this case) Firefox behaves completely different.

For example, if the SFU announces sinit_num_ostreams = 2048 and sinit_max_instreams = 2048, then Firefox 69 renegotiates from 16 to 32 outgoing SCTP streams once (upon the first DataChannel is created) and then it does NOT renegotiate it anymore, so it FAILS when creating a DataChannel with id > 31.

However, if the SFU announces sinit_num_ostreams = 16 and sinit_max_instreams = 2048 (this is, the same than FF 69 announces), then Firefox 69 renegotiates from 16 to 32 outgoing SCTP streams (upon the first DataChannel is created), and then increases by 16 over and over while new DataChannels are created and there are less than 16 outgoing ids left. So it can create up to 2048 DataChannels in "open" state.

Obviously something is wrong here. Is this patch supposed to fix it?

NOTE: Please remember that our SFU does support RE_CONFIG so it allows SCTP_ADD_STREAMS from the remote.

It's not merged, so you can't have it if you haven't applied the patch manually or downloaded the binaries from the try build. For me, the patch fixes the issue.

Can you please tell me how to download the binary from the "try build"? I see this:

https://ftp.mozilla.org/pub/firefox/try-builds/

but no idea...

FYI, Byron, you can also ping me for reviews/question/whatever regarding data channel related stuff in the future (also IRC). :)

Status given our experiments:

  • FF 69 properly renegotiates its OS and handles incoming SCTP_ADD_STREAMS from the remote, being able to create up to 2048 negotiated:true DataChannels (tested).

  • However FF 69 just behaves correctly if the remote announces OS 16, 256 or 1024. Otherwise it does not properly renegotiates its OS. Is this what the patch is supposed to fix?

    • As a curiosity, IHMO nobody here has yet explained what such a patch is supposed to fix.
    • There are some "it works for me" above which clarify nothing because I do not what their remote SCTP endpoints announce in the INIT.
    • Does the patch make FF behave as FF 67 in which the browser starts with OS 256 and adds 16 more outgoing streams once those are close to be exhausted?

(In reply to Iñaki Baz Castillo from comment #38)

Well, let me come back into this.

FF 69 does not allow creating DataChannels with id > 31. Well, it allows it since I do it with negotiated: true and FF does not complain (although it should not allow it). But then, when calling dc.send(DATA) it fails with:

OperationError: The operation failed for an operation-specific reason

So, in summary, it seems that FF 69 initially negotiates 16 outbound SCTP streams:

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:16, out:2048]

(note: those are the logs in the server so "in" means FF outgoing streams and "out" means FF incoming streams).

Then FF renegotiates the association by incrementing the number of SCTP outgoing streams to 32;

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:32, out:2048, flags:0]

So that's all. Somehow FF 69 allows 0-31 outgoing SCTP streams and 0-2047 incoming SCTP streams. Since DataChannels are bidirectional SCTP streams, this means that FF 69 limits itself to 32 DataChannels (with id in 0..31). Period.

So no, FF does not try to "add streams" once those 32 outgoing SCTP streams are used, not at all.

We should be renegotiating that limit every time it is exceeded. Can you provide a jsfiddle that demonstrates this problem on the binary from try?

(In reply to Iñaki Baz Castillo from comment #46)

Can you please tell me how to download the binary from the "try build"? I see this:

https://ftp.mozilla.org/pub/firefox/try-builds/

but no idea...

The most recent try is here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=089ea4298281ef41890c32bfde7e4094bdfeddee

I'm not sure which platform you're on, but here are the big ones:

Win32: https://queue.taskcluster.net/v1/task/bDa8V-JhQxSC2L-4SLYlEA/runs/0/artifacts/public/build/install/sea/target.installer.exe
Win64: https://queue.taskcluster.net/v1/task/M7FDITuiQ2WW6LE1-SzeVg/runs/0/artifacts/public/build/install/sea/target.installer.exe
OSX: https://queue.taskcluster.net/v1/task/UE60-kOMQEim0GlwvSF-ug/runs/0/artifacts/public/build/target.dmg
Linux32: https://queue.taskcluster.net/v1/task/XznVD-vQQnGToNJjM7v8lw/runs/0/artifacts/public/build/target.tar.bz2
Linux64: https://queue.taskcluster.net/v1/task/EDZAMpgZTU6wbDIHGAkxlA/runs/0/artifacts/public/build/target.tar.bz2

(In reply to Iñaki Baz Castillo from comment #48)

Status given our experiments:

  • FF 69 properly renegotiates its OS and handles incoming SCTP_ADD_STREAMS from the remote, being able to create up to 2048 negotiated:true DataChannels (tested).

  • However FF 69 just behaves correctly if the remote announces OS 16, 256 or 1024. Otherwise it does not properly renegotiates its OS. Is this what the patch is supposed to fix?

    • As a curiosity, IHMO nobody here has yet explained what such a patch is supposed to fix.

Comment 24 should cover the changes in the patch. The reason those changes bring the initial limit back up to 256 is because the "Increment by 16" was interfering/racing with the "initially negotiate 256".

  • There are some "it works for me" above which clarify nothing because I do not what their remote SCTP endpoints announce in the INIT.\

Yeah, to determine whether it works for you, you'll need to try out one of the binaries above.

  • Does the patch make FF behave as FF 67 in which the browser starts with OS 256 and adds 16 more outgoing streams once those are close to be exhausted?

That's the intent, and what others in this bug are seeing. Hopefully it will for you also, but we shall see!

We should be renegotiating that limit every time it is exceeded. Can you provide a jsfiddle that demonstrates this problem on the binary from try?

I cannot provide any jsfiddle because as I explained every depends on how the remote endpoint negotiates its OS and MIS in its SCPT INIT.

BTW: yes, FF 69 sends SCTP_ADD_STREAMS every 16 new DataChannels. When the peer is FF 69, our SFU initiates its SCTP INIT with OS:16 and MIS:2048. And we have added code to react on the SCTP_STREAM_CHANGE_EVENT (caused by FF sending a SCTP_ADD_STREAMS(16) to our SFU). In that code our SFU reacts by also sending a SCTP_ADD_STREAMS(N) to FF 69.

In FF 67 this is different. If our SFU sends SCTP INIT with OS:16 and MIS:2048, Firefox 67 does not later send proper SCTP_ADD_STREAMS and cannot handle 2048 opened DataChannels.

So, depending on the OS in the SCTP INIT of the remote endpoint (our SFU in this case) different versions of Firefox behave different and wrongly in same cases.

Said that, you can test our environment with Chrome, FF 67 and FF 69 as follows:

  • Enter in https://test.mediasoup.org/?roomId=foo with two browsers.
  • In one of them, open the console and type __testSctp().
  • Lot of DataChannels will be created in both browsers and messages with the outgoing SCTP stream will be sent from that browser to the SFU and from the SFU to the other browser.

As a very funny fact, if you do that test with two tabs of FF 69, then the receiver one will NOT send SCTP_ADD_STREAMS so won't be able to create and connect more than 32 DataChannels. That does not happens if there is a single FF tab doing the test in any direction, so it seems that Firefox has some kind of global limit.

Does the patch make FF behave as FF 67 in which the browser starts with OS 256 and adds 16 more outgoing streams once those are close to be exhausted?
That's the intent, and what others in this bug are seeing. Hopefully it will for you also, but we shall see!

Thanks, Byron. Ok, this is what I see by testing Firefox NightlyDebug with the patch:

# mediasoup (our SFU) initiates its SCTP with OS:256 and MIS:2048 when the remote is Firefox NightlyDebug 69 (with the patch applied).
#
# For each different streamId below, both Chrome (sender) and FF 69 (receiver) create a negotiated:true DataChannel.
# Every received SCTP message below has been sent by Chrome. The SFU sends them to FF 69 also using incremental streamIds.


# This is the log of the SCTP initialization with FF 69:

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:256, out:256]
mediasoup-demo-server:INFO:Room WebRtcTransport "sctpstatechange" event [sctpState:connected] +285ms


# Chrome starts creating negotiated DataChannels and sending messages with incremental streamId (one for each DC).
# Firefox creates a negotiated DataChannel for each of them:

mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:2, SSN:0, TSN:465916674, PPID:51, context:0, flags:8] +8s
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:3, SSN:0, TSN:465916675, PPID:51, context:0, flags:8] +143ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:4, SSN:0, TSN:465916676, PPID:51, context:0, flags:8] +239ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:5, SSN:0, TSN:465916677, PPID:51, context:0, flags:8] +117ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:6, SSN:0, TSN:465916678, PPID:51, context:0, flags:8] +112ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:7, SSN:0, TSN:465916679, PPID:51, context:0, flags:8] +150ms

# etc etc, more messages for streamId++...

mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:250, SSN:0, TSN:465916922, PPID:51, context:0, flags:8] +112ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:251, SSN:0, TSN:465916923, PPID:51, context:0, flags:8] +121ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:252, SSN:0, TSN:465916924, PPID:51, context:0, flags:8] +134ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:253, SSN:0, TSN:465916925, PPID:51, context:0, flags:8] +111ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:254, SSN:0, TSN:465916926, PPID:51, context:0, flags:8] +114ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:255, SSN:0, TSN:465916927, PPID:51, context:0, flags:8] +117ms

# At this point Chrome sends message with streamId 256 (it may be different) which is mapped to 256 in the SFU sender side.
Since the SFU has negotiated OS:256 with FF 69, the `usrsctp_sendv()` of the SFU fails:

mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:256, SSN:0, TSN:465916928, PPID:51, context:0, flags:8] +111ms
mediasoup:WARN:worker[pid:37260] RTC::SctpAssociation::SendSctpMessage() | error sending SCTP message [sid:256, ppid:51, len:17]: Invalid argument +0ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:257, SSN:0, TSN:465916929, PPID:51, context:0, flags:8] +109ms
mediasoup:WARN:worker[pid:37260] RTC::SctpAssociation::SendSctpMessage() | error sending SCTP message [sid:257, ppid:51, len:17]: Invalid argument +110ms

# Then Firefox 69 sends SCTP_ADD_STREAMS to increase its OS in 16 more streams. The SFU reacts by also incrementing its OS in 16:

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:272, out:256, flags:0] +42ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | adding 16 outgoing streams +0ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:272, out:272, flags:0] +2ms

# So new messages can be sent to Firefox 69:

mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:258, SSN:0, TSN:465916930, PPID:51, context:0, flags:8] +80ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:259, SSN:0, TSN:465916931, PPID:51, context:0, flags:8] +116ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:260, SSN:0, TSN:465916932, PPID:51, context:0, flags:8] +122ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:261, SSN:0, TSN:465916933, PPID:51, context:0, flags:8] +118ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:262, SSN:0, TSN:465916934, PPID:51, context:0, flags:8] +121ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:263, SSN:0, TSN:465916935, PPID:51, context:0, flags:8] +129ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:264, SSN:0, TSN:465916936, PPID:51, context:0, flags:8] +113ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:265, SSN:0, TSN:465916937, PPID:51, context:0, flags:8] +178ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:266, SSN:0, TSN:465916938, PPID:51, context:0, flags:8] +139ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:267, SSN:0, TSN:465916939, PPID:51, context:0, flags:8] +365ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:268, SSN:0, TSN:465916940, PPID:51, context:0, flags:8] +315ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:269, SSN:0, TSN:465916941, PPID:51, context:0, flags:8] +125ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:270, SSN:0, TSN:465916942, PPID:51, context:0, flags:8] +126ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:271, SSN:0, TSN:465916943, PPID:51, context:0, flags:8] +143ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:272, SSN:0, TSN:465916944, PPID:51, context:0, flags:8] +118ms

# Same process every 16 new streamIds:

mediasoup:WARN:worker[pid:37260] RTC::SctpAssociation::SendSctpMessage() | error sending SCTP message [sid:272, ppid:51, len:17]: Invalid argument +2s
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:288, out:272, flags:0] +86ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | adding 16 outgoing streams +1ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:288, out:288, flags:0] +2ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:273, SSN:0, TSN:465916945, PPID:51, context:0, flags:8] +54ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:274, SSN:0, TSN:465916946, PPID:51, context:0, flags:8] +123ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:275, SSN:0, TSN:465916947, PPID:51, context:0, flags:8] +116ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:276, SSN:0, TSN:465916948, PPID:51, context:0, flags:8] +130ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:277, SSN:0, TSN:465916949, PPID:51, context:0, flags:8] +131ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:278, SSN:0, TSN:465916950, PPID:51, context:0, flags:8] +118ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:279, SSN:0, TSN:465916951, PPID:51, context:0, flags:8] +160ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:280, SSN:0, TSN:465916952, PPID:51, context:0, flags:8] +112ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:281, SSN:0, TSN:465916953, PPID:51, context:0, flags:8] +219ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:282, SSN:0, TSN:465916954, PPID:51, context:0, flags:8] +183ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:283, SSN:0, TSN:465916955, PPID:51, context:0, flags:8] +136ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:284, SSN:0, TSN:465916956, PPID:51, context:0, flags:8] +127ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:285, SSN:0, TSN:465916957, PPID:51, context:0, flags:8] +125ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:286, SSN:0, TSN:465916958, PPID:51, context:0, flags:8] +203ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:287, SSN:0, TSN:465916959, PPID:51, context:0, flags:8] +133ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:288, SSN:0, TSN:465916960, PPID:51, context:0, flags:8] +117ms
mediasoup:WARN:worker[pid:37260] RTC::SctpAssociation::SendSctpMessage() | error sending SCTP message [sid:288, ppid:51, len:17]: Invalid argument +2s
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:304, out:288, flags:0] +105ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | adding 16 outgoing streams +0ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:304, out:304, flags:0] +1ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:289, SSN:0, TSN:465916961, PPID:51, context:0, flags:8] +67ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:290, SSN:0, TSN:465916962, PPID:51, context:0, flags:8] +151ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:291, SSN:0, TSN:465916963, PPID:51, context:0, flags:8] +188ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:292, SSN:0, TSN:465916964, PPID:51, context:0, flags:8] +136ms

# etc etc

IMHO FF 69 behaves better without the path. Without the patch, FF 69 sends the SCTP_ADD_STREAMS when just 16 unused streamIds left. However, with the patch, it looks like FF sends such a SCTP_ADD_STREAMS when it really runs out of streamIds. NOTE: This may be caused due the high CPU if FF in the test above, not sure.

(In reply to Iñaki Baz Castillo from comment #51)

We should be renegotiating that limit every time it is exceeded. Can you provide a jsfiddle that demonstrates this problem on the binary from try?

I cannot provide any jsfiddle because as I explained every depends on how the remote endpoint negotiates its OS and MIS in its SCPT INIT.

BTW: yes, FF 69 sends SCTP_ADD_STREAMS every 16 new DataChannels. When the peer is FF 69, our SFU initiates its SCTP INIT with OS:16 and MIS:2048. And we have added code to react on the SCTP_STREAM_CHANGE_EVENT (caused by FF sending a SCTP_ADD_STREAMS(16) to our SFU). In that code our SFU reacts by also sending a SCTP_ADD_STREAMS(N) to FF 69.

In FF 67 this is different. If our SFU sends SCTP INIT with OS:16 and MIS:2048, Firefox 67 does not later send proper SCTP_ADD_STREAMS and cannot handle 2048 opened DataChannels.

I can't say I am terribly surprised by this.

So, depending on the OS in the SCTP INIT of the remote endpoint (our SFU in this case) different versions of Firefox behave different and wrongly in same cases.

Said that, you can test our environment with Chrome, FF 67 and FF 69 as follows:

  • Enter in https://test.mediasoup.org/?roomId=foo with two browsers.
  • In one of them, open the console and type __testSctp().
  • Lot of DataChannels will be created in both browsers and messages with the outgoing SCTP stream will be sent from that browser to the SFU and from the SFU to the other browser.

Thanks, I'll try this.

As a very funny fact, if you do that test with two tabs of FF 69, then the receiver one will NOT send SCTP_ADD_STREAMS so won't be able to create and connect more than 32 DataChannels. That does not happens if there is a single FF tab doing the test in any direction, so it seems that Firefox has some kind of global limit.

Did you observe this on nightly, or one of the try builds? I would suppose this is a problem with Firefox interoping with itself (as opposed to your server) in the test you've written, and not a global limit.

(In reply to Iñaki Baz Castillo from comment #52)

Does the patch make FF behave as FF 67 in which the browser starts with OS 256 and adds 16 more outgoing streams once those are close to be exhausted?
That's the intent, and what others in this bug are seeing. Hopefully it will for you also, but we shall see!

Thanks, Byron. Ok, this is what I see by testing Firefox NightlyDebug with the patch:

# mediasoup (our SFU) initiates its SCTP with OS:256 and MIS:2048 when the remote is Firefox NightlyDebug 69 (with the patch applied).
#
# For each different streamId below, both Chrome (sender) and FF 69 (receiver) create a negotiated:true DataChannel.
# Every received SCTP message below has been sent by Chrome. The SFU sends them to FF 69 also using incremental streamIds.


# This is the log of the SCTP initialization with FF 69:

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP association connected, streams [in:256, out:256]
mediasoup-demo-server:INFO:Room WebRtcTransport "sctpstatechange" event [sctpState:connected] +285ms


# Chrome starts creating negotiated DataChannels and sending messages with incremental streamId (one for each DC).
# Firefox creates a negotiated DataChannel for each of them:

mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:2, SSN:0, TSN:465916674, PPID:51, context:0, flags:8] +8s
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:3, SSN:0, TSN:465916675, PPID:51, context:0, flags:8] +143ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:4, SSN:0, TSN:465916676, PPID:51, context:0, flags:8] +239ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:5, SSN:0, TSN:465916677, PPID:51, context:0, flags:8] +117ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:6, SSN:0, TSN:465916678, PPID:51, context:0, flags:8] +112ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:15, streamId:7, SSN:0, TSN:465916679, PPID:51, context:0, flags:8] +150ms

# etc etc, more messages for streamId++...

mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:250, SSN:0, TSN:465916922, PPID:51, context:0, flags:8] +112ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:251, SSN:0, TSN:465916923, PPID:51, context:0, flags:8] +121ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:252, SSN:0, TSN:465916924, PPID:51, context:0, flags:8] +134ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:253, SSN:0, TSN:465916925, PPID:51, context:0, flags:8] +111ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:254, SSN:0, TSN:465916926, PPID:51, context:0, flags:8] +114ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:255, SSN:0, TSN:465916927, PPID:51, context:0, flags:8] +117ms

# At this point Chrome sends message with streamId 256 (it may be different) which is mapped to 256 in the SFU sender side.
Since the SFU has negotiated OS:256 with FF 69, the `usrsctp_sendv()` of the SFU fails:

mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:256, SSN:0, TSN:465916928, PPID:51, context:0, flags:8] +111ms
mediasoup:WARN:worker[pid:37260] RTC::SctpAssociation::SendSctpMessage() | error sending SCTP message [sid:256, ppid:51, len:17]: Invalid argument +0ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:257, SSN:0, TSN:465916929, PPID:51, context:0, flags:8] +109ms
mediasoup:WARN:worker[pid:37260] RTC::SctpAssociation::SendSctpMessage() | error sending SCTP message [sid:257, ppid:51, len:17]: Invalid argument +110ms

# Then Firefox 69 sends SCTP_ADD_STREAMS to increase its OS in 16 more streams. The SFU reacts by also incrementing its OS in 16:

mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:272, out:256, flags:0] +42ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | adding 16 outgoing streams +0ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:272, out:272, flags:0] +2ms

# So new messages can be sent to Firefox 69:

mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:258, SSN:0, TSN:465916930, PPID:51, context:0, flags:8] +80ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:259, SSN:0, TSN:465916931, PPID:51, context:0, flags:8] +116ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:260, SSN:0, TSN:465916932, PPID:51, context:0, flags:8] +122ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:261, SSN:0, TSN:465916933, PPID:51, context:0, flags:8] +118ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:262, SSN:0, TSN:465916934, PPID:51, context:0, flags:8] +121ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:263, SSN:0, TSN:465916935, PPID:51, context:0, flags:8] +129ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:264, SSN:0, TSN:465916936, PPID:51, context:0, flags:8] +113ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:265, SSN:0, TSN:465916937, PPID:51, context:0, flags:8] +178ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:266, SSN:0, TSN:465916938, PPID:51, context:0, flags:8] +139ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:267, SSN:0, TSN:465916939, PPID:51, context:0, flags:8] +365ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:268, SSN:0, TSN:465916940, PPID:51, context:0, flags:8] +315ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:269, SSN:0, TSN:465916941, PPID:51, context:0, flags:8] +125ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:270, SSN:0, TSN:465916942, PPID:51, context:0, flags:8] +126ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:271, SSN:0, TSN:465916943, PPID:51, context:0, flags:8] +143ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:272, SSN:0, TSN:465916944, PPID:51, context:0, flags:8] +118ms

# Same process every 16 new streamIds:

mediasoup:WARN:worker[pid:37260] RTC::SctpAssociation::SendSctpMessage() | error sending SCTP message [sid:272, ppid:51, len:17]: Invalid argument +2s
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:288, out:272, flags:0] +86ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | adding 16 outgoing streams +1ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:288, out:288, flags:0] +2ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:273, SSN:0, TSN:465916945, PPID:51, context:0, flags:8] +54ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:274, SSN:0, TSN:465916946, PPID:51, context:0, flags:8] +123ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:275, SSN:0, TSN:465916947, PPID:51, context:0, flags:8] +116ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:276, SSN:0, TSN:465916948, PPID:51, context:0, flags:8] +130ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:277, SSN:0, TSN:465916949, PPID:51, context:0, flags:8] +131ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:278, SSN:0, TSN:465916950, PPID:51, context:0, flags:8] +118ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:279, SSN:0, TSN:465916951, PPID:51, context:0, flags:8] +160ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:280, SSN:0, TSN:465916952, PPID:51, context:0, flags:8] +112ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:281, SSN:0, TSN:465916953, PPID:51, context:0, flags:8] +219ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:282, SSN:0, TSN:465916954, PPID:51, context:0, flags:8] +183ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:283, SSN:0, TSN:465916955, PPID:51, context:0, flags:8] +136ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:284, SSN:0, TSN:465916956, PPID:51, context:0, flags:8] +127ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:285, SSN:0, TSN:465916957, PPID:51, context:0, flags:8] +125ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:286, SSN:0, TSN:465916958, PPID:51, context:0, flags:8] +203ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:287, SSN:0, TSN:465916959, PPID:51, context:0, flags:8] +133ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:288, SSN:0, TSN:465916960, PPID:51, context:0, flags:8] +117ms
mediasoup:WARN:worker[pid:37260] RTC::SctpAssociation::SendSctpMessage() | error sending SCTP message [sid:288, ppid:51, len:17]: Invalid argument +2s
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:304, out:288, flags:0] +105ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | adding 16 outgoing streams +0ms
mediasoup:worker RTC::SctpAssociation::OnUsrSctpReceiveSctpNotification() | SCTP stream changed, streams [in:304, out:304, flags:0] +1ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:289, SSN:0, TSN:465916961, PPID:51, context:0, flags:8] +67ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:290, SSN:0, TSN:465916962, PPID:51, context:0, flags:8] +151ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:291, SSN:0, TSN:465916963, PPID:51, context:0, flags:8] +188ms
mediasoup:worker RTC::SctpAssociation::onRecvSctpData() | data chunk received [length:17, streamId:292, SSN:0, TSN:465916964, PPID:51, context:0, flags:8] +136ms

# etc etc

IMHO FF 69 behaves better without the path. Without the patch, FF 69 sends the SCTP_ADD_STREAMS when just 16 unused streamIds left. However, with the patch, it looks like FF sends such a SCTP_ADD_STREAMS when it really runs out of streamIds. NOTE: This may be caused due the high CPU if FF in the test above, not sure.

That's the intended behavior. I guess we could try to preserve some headroom to speed up negotiation as an optimization, but it is unclear to me whether this will be of any help in practice given that the initial number should be 256 again.

Also, let's not get distracted from the fact that the patch fixes a show stopper bug that broke data channel interop with Chrome/libwebrtc.

As a very funny fact, if you do that test with two tabs of FF 69, then the receiver one will NOT send SCTP_ADD_STREAMS so won't be able to create and connect more than 32 DataChannels. That does not happens if there is a single FF tab doing the test in any direction, so it seems that Firefox has some kind of global limit.

Did you observe this on nightly, or one of the try builds? I would suppose this is a problem with Firefox interoping with itself (as opposed to your server) in the test you've written, and not a global limit.

In fact this happens with my test, this is wtih Firefox_1 sending lot of messages to the SFU via different streamIds, and the SFU forwarding them to Firefox_2. However, if instead of two Firefox there is Firefox in Chrome, no matter who runs the test, Firefox works fine. This is in Nightly with and without the patch.

That's the intended behavior. I guess we could try to preserve some headroom to speed up negotiation as an optimization, but it is unclear to me whether this will be of any help in practice given that the initial number should be 256 again.

It's ok IMHO. It may just happen that some initial messages are lost due to the lack of OS in the sender. Not important.

Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9a84014bc0e9
Set mNegotiatedIdLimit when the channel opens, and remove some code that was stomping our initially chosen max streams. r=mjf
Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70

Please nominate this for Beta uplift when you get a chance. Also, what sort of testing does this fix need? Doesn't look like we added any new automated tests? Do we need at least a QA pass?

Flags: needinfo?(docfaraday)

Comment on attachment 9075201 [details]
Bug 1562341: Set mNegotiatedIdLimit when the channel opens, and remove some code that was stomping our initially chosen max streams. r?mjf

Beta/Release Uplift Approval Request

  • User impact if declined: DataChannel interop problems with non-browser endpoints.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): This code has poor automated testing coverage, and it is very difficult to write such code. We think that this works better now, and the code is simpler, but we may yet find new interop problems.
  • String changes made/needed: None.
Flags: needinfo?(docfaraday)
Attachment #9075201 - Flags: approval-mozilla-beta?

Comment on attachment 9075201 [details]
Bug 1562341: Set mNegotiatedIdLimit when the channel opens, and remove some code that was stomping our initially chosen max streams. r?mjf

Fixes a WebRTC regression in Fx69. Approved for 69.0b5.

Attachment #9075201 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.