Datachannel hangs on close - race condition causing iloop with a lock held.
Categories
(Core :: WebRTC: Networking, defect, P2)
Tracking
()
People
(Reporter: drno, Assigned: jesup)
References
(Regression)
Details
(Keywords: regression)
Attachments
(2 files)
2.27 MB,
application/x-gzip
|
Details | |
48 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-esr91+
|
Details | Review |
I'm able to reproduce content process hangs (100% CPU usage) when keeping https://test-network.callstats.io/ open for a few minutes.
Here is a Firefox profiler run for it https://share.firefox.dev/3BJLwLq
This reproduces for me on Nightly 95 and Release 93. Both running on macOS 11.6.
I was able to capture logs with DataChannel:5 for it. Unfortunately the logs get truncated. And as soon as I add 'sync' to the MOZ_LOG setting the bug no longer reproduces.
Comment 1•3 years ago
|
||
So, we're deadlocking because we're going reentrant while firing a "closed" event. Looking at webrtc-pc, I actually don't see any "queue a task" step that would protect from this. We might be able to fix the deadlock fairly easily, but other reentrancy bugs may remain.
jib, is there something in the spec I missed that would protect from this if we implemented it?
Assignee | ||
Comment 2•3 years ago
|
||
[Tracking Requested - why for this release]: Can cause any page using datachannels to lock up, chewing a core.
We're not deadlocked, we're i-looping with a lock held.
It's i-looping with the lock held on Socket Thread. It's a regression from bug 1556795 - a rework of the channel allocation logic. We're in a threshold event, and call SendDeferredMessages(). We have the lock. We call Channels::Get(0) (i = 0, end = 1). We don't get a channel (returns nullptr), so we call UpdateCurrentStreamIndex(), which calls Channels::GetNextChannel(), but the channel array is empty, so we return nullptr, and we return 0 and loop. So i never gets above 0, and we don't find a channel 0, so we never stop looping.
Probably it's a race condition with a close that modified the channel array. Since end=1, GetCurrentStreamIndex() must have returned 1, and i was 1 for the first iteration.
There might still be a spec issue, but that would be a separate bug. We do dispatch an event for the close stuff.
Updated•3 years ago
|
Comment 3•3 years ago
|
||
Set release status flags based on info from the regressing bug 1556795
Comment 4•3 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #2)
[Tracking Requested - why for this release]: Can cause any page using datachannels to lock up, chewing a core.
We're not deadlocked, we're i-looping with a lock held.
It's i-looping with the lock held on Socket Thread. It's a regression from bug 1556795 - a rework of the channel allocation logic. We're in a threshold event, and call SendDeferredMessages(). We have the lock. We call Channels::Get(0) (i = 0, end = 1). We don't get a channel (returns nullptr), so we call UpdateCurrentStreamIndex(), which calls Channels::GetNextChannel(), but the channel array is empty, so we return nullptr, and we return 0 and loop. So i never gets above 0, and we don't find a channel 0, so we never stop looping.
Probably it's a race condition with a close that modified the channel array. Since end=1, GetCurrentStreamIndex() must have returned 1, and i was 1 for the first iteration.
There might still be a spec issue, but that would be a separate bug. We do dispatch an event for the close stuff.
Hmm, the profile did not have the socket thread, so did not see this happening. We're still going reentrant on main, which is a huge problem I think. I'll look into this.
Assignee | ||
Comment 5•3 years ago
|
||
Depends on D128738
Comment 6•3 years ago
|
||
Oh, you were working a patch for this already.
Comment 7•3 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #1)
So, we're deadlocking because we're going reentrant while firing a "closed" event. Looking at webrtc-pc, I actually don't see any "queue a task" step that would protect from this. ... jib, is there something in the spec I missed that would protect from this if we implemented it?
If you mean the "close"
event, the spec says: "the user agent MUST queue a task to run the following steps: ... Fire an event named close
at channel.".
Updated•3 years ago
|
Updated•3 years ago
|
Comment 9•3 years ago
|
||
bugherder |
Updated•3 years ago
|
Comment 10•3 years ago
|
||
Is this something you wanted to nominate for ESR91 approval?
Assignee | ||
Comment 11•3 years ago
|
||
Comment on attachment 9246425 [details]
Bug 1735972: reset current channel if that channel is closed r=bwc
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: Can cause a core to get chewed forever
- User impact if declined: A datachannel may lock up randomly, chewing a core, slowing down the machine and wasting battery if mobile/laptop.
- Fix Landed on Version: 95
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Extremely simple patch. Just resets the current stream value if it's been deleted
- String or UUID changes made by this patch: None
Comment 12•3 years ago
|
||
Comment on attachment 9246425 [details]
Bug 1735972: reset current channel if that channel is closed r=bwc
Approved for 91.4esr.
Comment 13•3 years ago
|
||
bugherder uplift |
Description
•