Open Bug 1211097 Opened 9 years ago Updated 2 years ago

SDP Answer always advertises 256 streams per data channel

Categories

(Core :: WebRTC: Signaling, defect, P3)

defect

Tracking

()

Tracking Status
firefox44 --- affected

People

(Reporter: drno, Unassigned)

References

Details

(Keywords: dev-doc-needed, Whiteboard: memshrink)

From looking at https://bug1207056.bmoattachments.org/attachment.cgi?id=8667133 from bug 1207056 it appears that we always answer with 256 streams in the sctpmap no matter what was offered. What might be potentially harmful is that we seem to actually use the value from the offer to initialize our SCTP stack, see bug 1211091. So a pedantic offerer could assume that we only want to do 256 streams, even though he offered 1024, and Firefox internally sets up 1024 streams as requested by the offerer.
backlog: --- → webrtc/webaudio+
Rank: 25
Priority: -- → P2
I have a new unit test which verifies my suspicion: Firefox always replies with 256 streams in the sctpmap, no matter what got offered.
After reading the old draft and the current draft, which does not even have a parameter for streams any more, I'm not sure if the amount of streams is actually something which needs to be agreed on.

In any case claiming that we do 256 streams but in reality do something different is something which should get fixed.
But if both sides don't have to agree on the same amount of streams the fix for this problem here could be simply:
A) we always use 256 as we already advertise in our SDP
B) we always use the current maximum of 2048 and advertise that properly in our SDP

Randel, Michael any thoughts on this?
Flags: needinfo?(rjesup)
The second paragraph of https://tools.ietf.org/html/draft-ietf-rtcweb-data-channel-13#section-6.2 reads:

   The number of streams negotiated during SCTP association setup SHOULD
   be 65535, which is the maximum number of streams that can be
   negotiated during the association setup.

I think that is the number that should be used in the code and in SDP.
(In reply to Michael Tüxen from comment #3)
> The second paragraph of
> https://tools.ietf.org/html/draft-ietf-rtcweb-data-channel-13#section-6.2
> reads:
> 
>    The number of streams negotiated during SCTP association setup SHOULD
>    be 65535, which is the maximum number of streams that can be
>    negotiated during the association setup.
> 
> I think that is the number that should be used in the code and in SDP.

The downside is a not-insignificant memory hit (especially for mobile) to allocate 65K channels on each PeerConnection that uses DataChannels.  It's at least a pointer per channel (so 256/512K), which matters for mobile given in 99.9% of cases 99.9% is unused.  The plan was to make the channels array in DataChannel.h/cpp at least lazily-created, and preferably also sparse (so a pre-negotiated channel of 64000 doesn't force a huge empty array), and drop the negotiation parameter.  Note that Chrome currently uses 2048, also.

One option would be to create (at the protocol level) 65K channels, but indicate 2048 in the SDP.  However, we need to check what happens at the code level when this is done, as we still have code for reconfiguring the number of channels automatically, which chrome doesn't support.  This ability may be the reason for the current ugly state of the code; you should check the history of the bugs that led to the current state.  Older Mozillas used 16 as the default; we bumped to 256 when we fixed a reconfigure bug, and moved to the current state IIRC due to a bug in FF/Chrome interop.

In any case, this absolutely needs cleanup (it was on my shortlist), and the preferred state would be to match the spec (SDP and DataChannels draft - no channels in SDP, 65K configured at the SCTP level, low memory usage (preferably making the internal SCTP channel array dynamic as well; IIRC the stack has at least a pointer per stream as well as ours).  (I suspect the memory hit is why Chrome uses 2048.)
Flags: needinfo?(rjesup)
Yepp, the memory footprint is exactly the issue I mentioned when discussing the usage of the stream reconfiguration option to increase the number of streams. The consensus in the IETF WG group was that it is simpler to deal with 64K streams than to use the protocol mechanism to increase the numbers.

Using a thin array in the SCTP stack is on the ToDo list, but not on the top...
Yes, and I argued for quite a while for reconfig support (which we have).  It does add complexity and did cause one bug.  Michael: can you verify the per-configured input+output stream memory overhead in the stack?  IIRC in our impl in the code on top of the stack it's 1 pointer per stream supported.  For added fun, I think allocations of exactly power-of-two sizes (or very close) bump you up to double the request in jmalloc, so 65535 pointers may end up using 2x the memory actually requested.  This effect would hit the SCTP stack's allocations as well, but just report what you'll request, I'll figure out the resultant impact.
Flags: needinfo?(tuexen)
Hi Randell,

since pointers are involved, the results are different on 32-bit and 64-bit platforms.
For 64-bit platforms we currently use
* 24 bytes for each incoming stream. This will grow when we deploy I-DATA.
* 64 bytes for each outgoing stream if you don't have detailed statistic information. This will grow when we implement support for bufferedAmount and bufferedAmountLowThreshold.

To limit the memory impact, I could implement in a way that each stream costs a pointer (8 bytes) and perform a late allocation of the real data. So it would always cost 8 bytes and when used it will cost 8 bytes + 64 bytes. This would not free the memory before the association is removed, but it would improve things. What do you think?
Flags: needinfo?(tuexen)
> For 64-bit platforms we currently use
> * 24 bytes for each incoming stream. This will grow when we deploy I-DATA.
> * 64 bytes for each outgoing stream if you don't have detailed statistic
> information. This will grow when we implement support for bufferedAmount and
> bufferedAmountLowThreshold.
> 
> To limit the memory impact, I could implement in a way that each stream
> costs a pointer (8 bytes) and perform a late allocation of the real data. So
> it would always cost 8 bytes and when used it will cost 8 bytes + 64 bytes.
> This would not free the memory before the association is removed, but it
> would improve things. What do you think?

With those costs (even if a little better in 32-bit), we DEFINITELY want to use late and/or sparse allocation of stream data.

For 2048 streams, we have 2048*8 bytes of DataChannel overhead (16KB), plus 88*2048 = 176K in the stack.  64K channels would be 512KB in DataChannels, and ~5.5MB for the SCTP stack.

There are several steps, from biggest to most efficient (in most cases):
* Direct arrays of structures (currently what SCTP uses)
* Arrays of pointers to structures, lazily allocated (what DataChannel.cpp currently uses)
* Expandable array of structures (reallocate, maybe in 16-stream chunks, to include highest channel used)
* Expandable array of pointers to structures (ditto, but deals with 'sparse' usage better - apps that don't allocate channels/streams starting at/near 0)
* Sparse array of structures.  (Requires considerably more logic and may be slower or require storing them in a hashtable)
* Sparse array of pointers to structures (may reduce overhead to walk the array and alloc/free overhead by allowing them to allocated in N-stream/channel chunks, since it's just a block of pointers).

In general, CPU use increases as you go down the list (as memory use declines).  Not always, though, in that hashtables might be reasonably efficient in practice unless a huge number of channels are active at once, especially if the hashtable reorders collision lists to put the most recently-retrieved item at the front, or other optimizations.  Note that the Hash implementation will eat some memory too.
Whiteboard: memshrink
Yepp, you are right, the first option is what SCTP currently uses. But it is under control
of the application. If the application increases the number of streams, we reallocate the 
memory. The problem with WebRTC is, that the application always wants the maximum number of streams.
Since I prefer to keep the kernel and userland stack in sync, I don't want to do the sparse stuff,
since everyone will pay a penalty. I guess it should be doable to do late/lazy allocation. We will need to fail the send call with ENOBUFS if the late memory allocation fails... This applies to outgoing streams. For incoming, I'm not sure this is appropriate. Dropping the TSNs for which I can't allocate the instream structure... Not sure.
Sigh.  And now we're stuck with it.  Ugly as it may be, we may want to reopen this on the rtcweb list.  Or do *something* in the stack to reduce the pain:

#if KERNEL
  stream_array = calloc(max_streams*sizeof(stream_array[0]));

void ensure_stream(uint16_t id) {}
#else
  stream_array = calloc(min_streams*sizeof(stream_array[0]));
  stream_array_size = min_streams;

void ensure_stream(uint16_t id)
{
  if (id >= stream_array_size) {
    new_size = MIN(id+STREAM_ALLOC_INCREMENT, max_streams);
    new_stream_array = calloc(new_size*sizeof(stream_array[0]));
    assert(new_stream_array);
    memcpy(new_stream_array, stream_array, stream_array_size*sizeof(stream_array[0]));
    free(stream_array);
    stream_array = new_stream_array;
    stream_array_size = new_size; 
  }
}
#endif
Well, the point in the WG discussion is that one CAN implement this with optimal memory efficiency (as you mention above). However this might not be a good choice for all SCTP implementations, only for userland stacks used for WebRTC.
Let's see if we can find an acceptable compromise the usrsctp stack, which shares a lot of code with the FreeBSD kernel sources. And I would like to keep the differences between the sources as small as possible.
So, is is acceptable to allocate unconditionally a single pointer per outgoing stream? The actual structures would only be allocated when the outgoing stream is used (which means that the time you send the first message on it).
If that is not acceptable, we would need another layer of indirection, which will result in a performance loss, I guess.
What do you think?
Flags: needinfo?(rjesup)
Pointer-per-possible-stream would mean 64K*sizeof(ptr) = 256 or 512K.  256K per peerconnection-with-datachannels is too big a cost for background uses of datachannels for mobile, especially as 99.9% of if won't be used.

If the number of default streams was 2K, then it would 8 or 16K/Connection, which while annoying is not horrible.  The overhead would be on the order of one extra pointer indirection, and in many cases it may not be need to be re-fetched.

I'd lean towards an extra pointer fetch (for usrsctp).
Flags: needinfo?(rjesup)
see comment 12
Flags: needinfo?(tuexen)
Not sure what you want...

So do you want a single pointer to an array of pointers to the streams? This array could be dynamic an contain the pointers for stream 0,...,n - 1. One would try to keep n as low as possible. However, if the application requests stream 65534, you would allocate all of them.
Is this what you are referring to?
Flags: needinfo?(tuexen) → needinfo?(rjesup)
Mass change P2->P3 to align with new Mozilla triage process.
Priority: P2 → P3
I don't have the capacity any more to work on this any time soon.
Assignee: drno → nobody

Clearing old needinfo as part of triage.

Flags: needinfo?(rjesup)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.