Closed Bug 1988096 Opened 4 months ago Closed 3 months ago

Sending data over an RTCDataChannel sometimes fails for an operation-specific reason

Tracking

()

Status:

RESOLVED FIXED

Milestone:

146 Branch

Tracking Flags:

Tracking

Status

firefox146

---

fixed

People

(Reporter: alex, Assigned: bwc)

References

(Blocks 1 open bug)

Details

Attachments

(11 files)

about-webrtc.txt 4 months ago alex 26.49 KB, text/plain		Details
channels-receive-wrong-messages.html 4 months ago alex 5.90 KB, text/html		Details
Bug 1988096: Test that ids are reusable as soon as the close event fires. r?jib 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1988096: Track whether stream ids are in use on a per-direction basis. r?ng 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1988096: Use labels in these DataChannel tests. r?jib 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1988096: Miscellaneous test cleanup. r?jib 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1988096: Make ResetStreams fallible. r?ng 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1988096: Use cancelable runnables, and fallible dispatch. r?ng 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1988096: Logging improvements. r?ng 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1988096: Reduce the number of addrefs/releases to simplify leak debugging. r?ng 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1988096: Make sure these are only run once. r?ng 3 months ago Byron Campen [:bwc] 48 bytes, text/x-phabricator-request		Details \| Review

alex

Reporter

Description

•

4 months ago

Attached file about-webrtc.txt — Details

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36

Steps to reproduce:

I create two RTCPeerConnection objects in the same browser process. The initiating peer creates a datachannel, they exchange SDP offers/answers/ICE candidates until both have a connectionState of "connected".

The initiating peer closes the first datachannel and opens a second.

The receiving peer stores a reference to the second incoming datachannel to ensure it is not garbage collected and registers a "message" event listener.

The initiating peer waits for the channel's readyState to be "open" and writes some data into it.

Actual results:

Calling .send intermittently throws an "DOMException" with the error message "The operation failed for an operation-specific reason" and a .code of 0.

The "close" event fires on the datachannel shortly afterwards though the peer connection retains a connectionState of "connected".

Expected results:

The .send method should not throw and the remote peer should receive the data.

The same code works in Chrome without any errors.

The connection log from about:webrtc during the test run is attached.

From what I can see, I have two peer connections - the receiver d1ae5ae4-616c-4119-8fde-345dc93878fc and the initiator - 97cd2ee9-b0e9-47ec-9932-5a307b33ca3a though I can't see any obvious errors in the log though maybe I'm missing something.

Is there a way to find out what the "operation-specific" reason was?

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

4 months ago

The Bugbug bot thinks this bug should belong to the 'Core::WebRTC' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → WebRTC

Product: Firefox → Core

Andreas Pehrson [:pehrsons]

Comment 2

•

4 months ago

Thanks for the report! That error (in 140) comes from the SendDataMsgCommon return value being anything other than 0 or EMSGSIZE. Looking at Nightly, the plumbing has shifted a bit here but I think the same logic applies. Might be worth a try in Nightly, though.

If Nightly exhibits the same behavior, I think the easiest way forward would be if you could provide a minimal example for reproducing this, that we can debug. Please attach a html file that shows the issue, or link a jsfiddle or similar to the same effect.

Flags: needinfo?(alex)

alex

Reporter

Comment 3

•

4 months ago

I'm trying to isolate the code into a runnable example, but it's proving very hard to trigger the same behaviour without the WebRTC code running in the context of a larger application.

I think the problem may be that the datachannel IDs become eligible for reuse immediately after .close is called on a channel?

What seems to be happening is that one side opens a datachannel, waits for the 'open' event, writes several pieces of data into it and closes it. It then does this again immediately.

Could it be possible that remote sends confirmation of the first closure while the local is in the middle of writing into the second channel (which has the same ID as the first)?

That is, if I wait for the 'close' event after closing the first channel but before opening a new channel, the "operation-specific error" seems to occur less frequently.

Flags: needinfo?(alex)

alex

Reporter

Comment 4

•

4 months ago

Attached file channels-receive-wrong-messages.html — Details

Sets up initiator and receiver peer connections.

The receiver listens for incoming datachannels. When one is opened it waits for the first message event, then echos the received data back to the sender and closes the channel.

The initiator opens a channel, sends a message and closes the channel. It then opens a second channel, sends a message and waits for the receiver to close the channel.

It does this in a loop until an error occurs.

In Firefox Nightly 144.0a1 this runs a couple of times then the second datachannel receives a message sent to the first datachannel.

This appears to be because both the initiator and the receiver close the datachannel. If line 154 is commented out (e.g. the initiator does not close the channel) the messages always arrive at the correct datachannel, but the loop eventually grinds to a halt after 10-40k iterations when it should run forever, given that both datachannels are closed before the loop continues.

alex

Reporter

Comment 5

•

4 months ago

I've added a reproduction file. It doesn't show this problem directly (e.g. it doesn't throw an "The operation failed for an operation-specific reason" error) but it does show that sometimes datachannels will receive messages sent to other datachannels if previously the same datachannel was closed by both ends of the connection. Should I open a new bug for this?

alex

Reporter

Comment 6

•

4 months ago

if previously the same datachannel was closed by both ends of the connection

I mean if a previous datachannel was closed by both ends of the connection. Sorry for the confusion, I can't edit my comments here.

alex

Reporter

Comment 7

•

4 months ago

I've opened https://bugzilla.mozilla.org/show_bug.cgi?id=1988454 as I'm not sure these two problems are related.

The "operation-specific reason" error only seems to happen when the channel IDs are the same, the "wrong channel delivery" can happen with different IDs and is probably serious enough to warrant tracking it separately.

Andreas Pehrson [:pehrsons]

Updated

•

4 months ago

Comment 8

•

4 months ago

Byron, could you take a look? (both here and bug 1988454 ideally)

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Assignee

Comment 9

•

4 months ago

I think I see what's happening here, but there's a spec wrinkle; selecting an already-in-use id when calling createDataChannel does not cause an error according to the spec, even though it cannot work and is clearly invalid. That's really strange, and means we can't test this fully in wpt right now. I think I can at least test that this specific bug does not occur though.

Assignee: nobody → docfaraday

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Assignee

Comment 10

•

4 months ago

I was hoping that I could get a try run today, but there was some infra bustage that ate my pushes. Trying again...

https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=153845
https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=153846

BugBot [:suhaib / :marco/ :calixte]

Comment 11

•

4 months ago

The severity field is not set for this bug.
:mjf, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(mfroman)

Michael Froman [:mjf]

Comment 12

•

4 months ago

Setting to S2 for now.

Severity: -- → S2

Flags: needinfo?(mfroman)

Byron Campen [:bwc]

Assignee

Comment 13

•

3 months ago

Attached file Bug 1988096: Test that ids are reusable as soon as the close event fires. r?jib — Details

Byron Campen [:bwc]

Assignee

Comment 14

•

3 months ago

Attached file Bug 1988096: Track whether stream ids are in use on a per-direction basis. r?ng — Details

Also, make sure that we don't fire close events until streams have been reset
in both directions.

Depends on D269062

Byron Campen [:bwc]

Assignee

Comment 15

•

3 months ago

Attached file Bug 1988096: Use labels in these DataChannel tests. r?jib — Details

Depends on D269063

Byron Campen [:bwc]

Assignee

Comment 16

•

3 months ago

Attached file Bug 1988096: Miscellaneous test cleanup. r?jib — Details

Add some missing test case cleanup, mark a test as long, use promise_test instead of async_test in one place.

Depends on D269064

Byron Campen [:bwc]

Assignee

Comment 17

•

3 months ago

Attached file Bug 1988096: Make ResetStreams fallible. r?ng — Details

Depends on D269065

Byron Campen [:bwc]

Assignee

Comment 18

•

3 months ago

Attached file Bug 1988096: Use cancelable runnables, and fallible dispatch. r?ng — Details

This helps ensure that these runnables (and all of their lambda captures)
aren't leaked during shutdown.

Depends on D269066

Byron Campen [:bwc]

Assignee

Comment 19

•

3 months ago

Attached file Bug 1988096: Logging improvements. r?ng — Details

Mostly this is logging crucial lifecycle events at INFO, not DEBUG.

Depends on D269067

Byron Campen [:bwc]

Assignee

Comment 20

•

3 months ago

Attached file Bug 1988096: Reduce the number of addrefs/releases to simplify leak debugging. r?ng — Details

Depends on D269068

Byron Campen [:bwc]

Assignee

Comment 21

•

3 months ago

Attached file Bug 1988096: Make sure these are only run once. r?ng — Details

Depends on D269069

Pulsebot

Comment 22

•

3 months ago

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 23

•

3 months ago

Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/55745 for changes under testing/web-platform/tests

Iulian Moraru

Comment 24

•

3 months ago

bugherder

Status: UNCONFIRMED → RESOLVED

Closed: 3 months ago

status-firefox146: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 146 Branch

Web Platform Test Sync Bot [:wpt-sync] (Matrix: #interop:mozilla.org)

Comment 25

•

3 months ago

Upstream PR merged by moz-wptsync-bot

Dianna Smith [:diannaS]

Updated

•

3 months ago

Regressions: 1997294

Mihai Boldan, Desktop QA [:mboldan]

Updated

•

2 months ago

QA Whiteboard: [qa-triage-done-c147/b146]

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Updated

•

1 month ago

Blocks: 2006978

You need to log in before you can comment on or make changes to this bug.