Closed Bug 1735972 Opened 2 months ago Closed 1 month ago

Datachannel hangs on close - race condition causing iloop with a lock held.

Categories

(Core :: WebRTC: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
95 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 --- fixed
firefox93 --- wontfix
firefox94 --- wontfix
firefox95 + fixed

People

(Reporter: drno, Assigned: jesup)

References

(Regression)

Details

(Keywords: regression)

Attachments

(2 files)

I'm able to reproduce content process hangs (100% CPU usage) when keeping https://test-network.callstats.io/ open for a few minutes.

Here is a Firefox profiler run for it https://share.firefox.dev/3BJLwLq

This reproduces for me on Nightly 95 and Release 93. Both running on macOS 11.6.

I was able to capture logs with DataChannel:5 for it. Unfortunately the logs get truncated. And as soon as I add 'sync' to the MOZ_LOG setting the bug no longer reproduces.

So, we're deadlocking because we're going reentrant while firing a "closed" event. Looking at webrtc-pc, I actually don't see any "queue a task" step that would protect from this. We might be able to fix the deadlock fairly easily, but other reentrancy bugs may remain.

jib, is there something in the spec I missed that would protect from this if we implemented it?

Flags: needinfo?(jib)

[Tracking Requested - why for this release]: Can cause any page using datachannels to lock up, chewing a core.

We're not deadlocked, we're i-looping with a lock held.

It's i-looping with the lock held on Socket Thread. It's a regression from bug 1556795 - a rework of the channel allocation logic. We're in a threshold event, and call SendDeferredMessages(). We have the lock. We call Channels::Get(0) (i = 0, end = 1). We don't get a channel (returns nullptr), so we call UpdateCurrentStreamIndex(), which calls Channels::GetNextChannel(), but the channel array is empty, so we return nullptr, and we return 0 and loop. So i never gets above 0, and we don't find a channel 0, so we never stop looping.

Probably it's a race condition with a close that modified the channel array. Since end=1, GetCurrentStreamIndex() must have returned 1, and i was 1 for the first iteration.

There might still be a spec issue, but that would be a separate bug. We do dispatch an event for the close stuff.

Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(docfaraday)
Keywords: regression
Regressed by: 1556795
Summary: Datachannel hangs on close → Datachannel hangs on close - race condition causing iloop with a lock held.

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #2)

[Tracking Requested - why for this release]: Can cause any page using datachannels to lock up, chewing a core.

We're not deadlocked, we're i-looping with a lock held.

It's i-looping with the lock held on Socket Thread. It's a regression from bug 1556795 - a rework of the channel allocation logic. We're in a threshold event, and call SendDeferredMessages(). We have the lock. We call Channels::Get(0) (i = 0, end = 1). We don't get a channel (returns nullptr), so we call UpdateCurrentStreamIndex(), which calls Channels::GetNextChannel(), but the channel array is empty, so we return nullptr, and we return 0 and loop. So i never gets above 0, and we don't find a channel 0, so we never stop looping.

Probably it's a race condition with a close that modified the channel array. Since end=1, GetCurrentStreamIndex() must have returned 1, and i was 1 for the first iteration.

There might still be a spec issue, but that would be a separate bug. We do dispatch an event for the close stuff.

Hmm, the profile did not have the socket thread, so did not see this happening. We're still going reentrant on main, which is a huge problem I think. I'll look into this.

Assignee: nobody → docfaraday
Flags: needinfo?(docfaraday)

Oh, you were working a patch for this already.

Assignee: docfaraday → rjesup

(In reply to Byron Campen [:bwc] from comment #1)

So, we're deadlocking because we're going reentrant while firing a "closed" event. Looking at webrtc-pc, I actually don't see any "queue a task" step that would protect from this. ... jib, is there something in the spec I missed that would protect from this if we implemented it?

If you mean the "close" event, the spec says: "the user agent MUST queue a task to run the following steps: ... Fire an event named close at channel.".

Flags: needinfo?(jib)
Severity: -- → S2
Priority: -- → P2
See Also: → 1728616
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/autoland/rev/aadf149cc1c8
reset current channel if that channel is closed r=bwc
Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → 95 Branch

Is this something you wanted to nominate for ESR91 approval?

Flags: needinfo?(rjesup)

Comment on attachment 9246425 [details]
Bug 1735972: reset current channel if that channel is closed r=bwc

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Can cause a core to get chewed forever
  • User impact if declined: A datachannel may lock up randomly, chewing a core, slowing down the machine and wasting battery if mobile/laptop.
  • Fix Landed on Version: 95
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Extremely simple patch. Just resets the current stream value if it's been deleted
  • String or UUID changes made by this patch: None
Flags: needinfo?(rjesup)
Attachment #9246425 - Flags: approval-mozilla-esr91?

Comment on attachment 9246425 [details]
Bug 1735972: reset current channel if that channel is closed r=bwc

Approved for 91.4esr.

Attachment #9246425 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+
You need to log in before you can comment on or make changes to this bug.