Closed Bug 1988096 Opened 4 months ago Closed 3 months ago

Sending data over an RTCDataChannel sometimes fails for an operation-specific reason

Categories

(Core :: WebRTC, defect)

Firefox 140
defect

Tracking

()

RESOLVED FIXED
146 Branch
Tracking Status
firefox146 --- fixed

People

(Reporter: alex, Assigned: bwc)

References

(Blocks 1 open bug)

Details

Attachments

(11 files)

Attached file about-webrtc.txt

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36

Steps to reproduce:

I create two RTCPeerConnection objects in the same browser process. The initiating peer creates a datachannel, they exchange SDP offers/answers/ICE candidates until both have a connectionState of "connected".

The initiating peer closes the first datachannel and opens a second.

The receiving peer stores a reference to the second incoming datachannel to ensure it is not garbage collected and registers a "message" event listener.

The initiating peer waits for the channel's readyState to be "open" and writes some data into it.

Actual results:

Calling .send intermittently throws an "DOMException" with the error message "The operation failed for an operation-specific reason" and a .code of 0.

The "close" event fires on the datachannel shortly afterwards though the peer connection retains a connectionState of "connected".

Expected results:

The .send method should not throw and the remote peer should receive the data.

The same code works in Chrome without any errors.

The connection log from about:webrtc during the test run is attached.

From what I can see, I have two peer connections - the receiver d1ae5ae4-616c-4119-8fde-345dc93878fc and the initiator - 97cd2ee9-b0e9-47ec-9932-5a307b33ca3a though I can't see any obvious errors in the log though maybe I'm missing something.

Is there a way to find out what the "operation-specific" reason was?

The Bugbug bot thinks this bug should belong to the 'Core::WebRTC' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → WebRTC
Product: Firefox → Core

Thanks for the report! That error (in 140) comes from the SendDataMsgCommon return value being anything other than 0 or EMSGSIZE. Looking at Nightly, the plumbing has shifted a bit here but I think the same logic applies. Might be worth a try in Nightly, though.

If Nightly exhibits the same behavior, I think the easiest way forward would be if you could provide a minimal example for reproducing this, that we can debug. Please attach a html file that shows the issue, or link a jsfiddle or similar to the same effect.

Flags: needinfo?(alex)

I'm trying to isolate the code into a runnable example, but it's proving very hard to trigger the same behaviour without the WebRTC code running in the context of a larger application.

I think the problem may be that the datachannel IDs become eligible for reuse immediately after .close is called on a channel?

What seems to be happening is that one side opens a datachannel, waits for the 'open' event, writes several pieces of data into it and closes it. It then does this again immediately.

Could it be possible that remote sends confirmation of the first closure while the local is in the middle of writing into the second channel (which has the same ID as the first)?

That is, if I wait for the 'close' event after closing the first channel but before opening a new channel, the "operation-specific error" seems to occur less frequently.

Flags: needinfo?(alex)

Sets up initiator and receiver peer connections.

The receiver listens for incoming datachannels. When one is opened it waits for the first message event, then echos the received data back to the sender and closes the channel.

The initiator opens a channel, sends a message and closes the channel. It then opens a second channel, sends a message and waits for the receiver to close the channel.

It does this in a loop until an error occurs.

In Firefox Nightly 144.0a1 this runs a couple of times then the second datachannel receives a message sent to the first datachannel.

This appears to be because both the initiator and the receiver close the datachannel. If line 154 is commented out (e.g. the initiator does not close the channel) the messages always arrive at the correct datachannel, but the loop eventually grinds to a halt after 10-40k iterations when it should run forever, given that both datachannels are closed before the loop continues.

I've added a reproduction file. It doesn't show this problem directly (e.g. it doesn't throw an "The operation failed for an operation-specific reason" error) but it does show that sometimes datachannels will receive messages sent to other datachannels if previously the same datachannel was closed by both ends of the connection. Should I open a new bug for this?

if previously the same datachannel was closed by both ends of the connection

I mean if a previous datachannel was closed by both ends of the connection. Sorry for the confusion, I can't edit my comments here.

I've opened https://bugzilla.mozilla.org/show_bug.cgi?id=1988454 as I'm not sure these two problems are related.

The "operation-specific reason" error only seems to happen when the channel IDs are the same, the "wrong channel delivery" can happen with different IDs and is probably serious enough to warrant tracking it separately.

See Also: → 1988454

Byron, could you take a look? (both here and bug 1988454 ideally)

Flags: needinfo?(docfaraday)

I think I see what's happening here, but there's a spec wrinkle; selecting an already-in-use id when calling createDataChannel does not cause an error according to the spec, even though it cannot work and is clearly invalid. That's really strange, and means we can't test this fully in wpt right now. I think I can at least test that this specific bug does not occur though.

Assignee: nobody → docfaraday
Flags: needinfo?(docfaraday)

I was hoping that I could get a try run today, but there was some infra bustage that ate my pushes. Trying again...

https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=153845
https://treeherder.mozilla.org/jobs?repo=try&landoCommitID=153846

The severity field is not set for this bug.
:mjf, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(mfroman)

Setting to S2 for now.

Severity: -- → S2
Flags: needinfo?(mfroman)

Also, make sure that we don't fire close events until streams have been reset
in both directions.

Depends on D269062

Depends on D269063

Add some missing test case cleanup, mark a test as long, use promise_test instead of async_test in one place.

Depends on D269064

Depends on D269065

This helps ensure that these runnables (and all of their lambda captures)
aren't leaked during shutdown.

Depends on D269066

Mostly this is logging crucial lifecycle events at INFO, not DEBUG.

Depends on D269067

Depends on D269069

Pushed by bcampen@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/03725a4aff91 https://hg.mozilla.org/integration/autoland/rev/2c7b813bc96c Test that ids are reusable as soon as the close event fires. r=jib https://github.com/mozilla-firefox/firefox/commit/e2d169db608a https://hg.mozilla.org/integration/autoland/rev/eac802dc63b4 Track whether stream ids are in use on a per-direction basis. r=ng https://github.com/mozilla-firefox/firefox/commit/b913a5c92a97 https://hg.mozilla.org/integration/autoland/rev/bfacca5f1eec Use labels in these DataChannel tests. r=jib https://github.com/mozilla-firefox/firefox/commit/c8fc75101bdc https://hg.mozilla.org/integration/autoland/rev/e686cc99b7e5 Miscellaneous test cleanup. r=jib https://github.com/mozilla-firefox/firefox/commit/f6fd8610b5e3 https://hg.mozilla.org/integration/autoland/rev/738a2847ad05 Make ResetStreams fallible. r=ng https://github.com/mozilla-firefox/firefox/commit/e0e338611430 https://hg.mozilla.org/integration/autoland/rev/ce0c3db06950 Use cancelable runnables, and fallible dispatch. r=ng https://github.com/mozilla-firefox/firefox/commit/8ca8d1d12979 https://hg.mozilla.org/integration/autoland/rev/d48fbf91af28 Logging improvements. r=ng https://github.com/mozilla-firefox/firefox/commit/4577d758859d https://hg.mozilla.org/integration/autoland/rev/eaae699d5b93 Reduce the number of addrefs/releases to simplify leak debugging. r=ng https://github.com/mozilla-firefox/firefox/commit/ce76403552ad https://hg.mozilla.org/integration/autoland/rev/87544e60cda8 Make sure these are only run once. r=ng
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/55745 for changes under testing/web-platform/tests
Upstream PR merged by moz-wptsync-bot
Regressions: 1997294
QA Whiteboard: [qa-triage-done-c147/b146]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: