Open Bug 1728616 Opened 5 months ago Updated 3 months ago

Intermittent deadlock in webrtc/RTCDataChannel-close.html breaking tsan?

Categories

(Core :: WebRTC: Networking, defect, P2)

defect

Tracking

()

People

(Reporter: bwc, Unassigned)

References

Details

This is different than what I've observed while working on bug 1635911:

[task 2021-08-31T23:32:14.266Z] 23:32:14 INFO - PID 1268 | [Child 1575: Main Thread]: D/DataChannel 7b2c00021160: Close()ing 7b3400034750
[task 2021-08-31T23:33:13.494Z] 23:33:13 INFO - PID 1268 | [Child 1575: Unnamed thread 7b440003bd80]: D/DataChannel In receive_cb, ulp_info=41
[task 2021-08-31T23:33:13.495Z] 23:33:13 INFO - PID 1268 | [Child 1575: Unnamed thread 7b440003bd80]: D/DataChannel In ReceiveCallback
[task 2021-08-31T23:35:21.451Z] 23:35:21 INFO - Got timeout in harness
[task 2021-08-31T23:35:21.454Z] 23:35:21 INFO - TEST-UNEXPECTED-TIMEOUT | /webrtc/RTCDataChannel-close.html | TestRunner hit external timeout (this may indicate a hang)
[task 2021-08-31T23:35:21.454Z] 23:35:21 INFO - TEST-INFO took 195004ms

That last log line is here:

https://searchfox.org/mozilla-central/rev/ac7da6c7306d86e2f86a302ce1e170ad54b3c1fe/netwerk/sctp/datachannel/DataChannel.cpp#2372

We do not see the following logging, which means we're in the case where !!data:

https://searchfox.org/mozilla-central/rev/ac7da6c7306d86e2f86a302ce1e170ad54b3c1fe/netwerk/sctp/datachannel/DataChannel.cpp#2375

From the logging, we are on an unnamed thread (in other words, we're getting a callback from libusrsctp), so we'll end up trying to lock here:

https://searchfox.org/mozilla-central/rev/ac7da6c7306d86e2f86a302ce1e170ad54b3c1fe/netwerk/sctp/datachannel/DataChannel.cpp#2379

Right before that, we see the "Close()ing" line; this ends up locking the same mutex here:

https://searchfox.org/mozilla-central/source/netwerk/sctp/datachannel/DataChannel.cpp#2989

It looks like there might be cases where we call into libusrsctp while holding that lock, which could cause a lock-order-inversion problem, and also cause main to deadlock, which would explain why we stop seeing logging for the entire process. This is just a hypothesis, though.

Severity: S3 → S2
Priority: P3 → P2

Maybe related to bug 1735972?

See Also: → 1735972
You need to log in before you can comment on or make changes to this bug.