usrsctp_finish is racy
Categories
(Core :: WebRTC: Networking, defect, P3)
Tracking
()
People
(Reporter: bwc, Unassigned)
References
Details
Observed this failure on TSan when trying to call usrsctp_finish after we shut down and deregister all of our DataChannelConnections:
[task 2020-06-18T01:42:48.511Z] 01:42:48 INFO - GECKO(1265) | WARNING: ThreadSanitizer: data race (pid=1405)
[task 2020-06-18T01:42:48.512Z] 01:42:48 INFO - GECKO(1265) | Write of size 1 at 0x7f24a2682210 by main thread (mutexes: write M1124909298406113120):
[task 2020-06-18T01:42:48.512Z] 01:42:48 INFO - GECKO(1265) | #0 sctp_finish /builds/worker/checkouts/gecko/netwerk/sctp/src/netinet/sctp_usrreq.c:236:42 (libxul.so+0x1324d16)
[task 2020-06-18T01:42:48.512Z] 01:42:48 INFO - GECKO(1265) | #1 usrsctp_finish /builds/worker/checkouts/gecko/netwerk/sctp/src/user_socket.c:2236:2 (libxul.so+0x1344f82)
[task 2020-06-18T01:42:48.514Z] 01:42:48 INFO - GECKO(1265) | #2 DeinitUsrSctp /builds/worker/checkouts/gecko/netwerk/sctp/datachannel/DataChannel.cpp:251:5 (libxul.so+0x135c07b)
[task 2020-06-18T01:42:48.515Z] 01:42:48 INFO - GECKO(1265) | #3 mozilla::DataChannelRegistry::DeregisterImpl(unsigned long) /builds/worker/checkouts/gecko/netwerk/sctp/datachannel/DataChannel.cpp:191:7 (libxul.so+0x135c07b)
[task 2020-06-18T01:42:48.515Z] 01:42:48 INFO - GECKO(1265) | #4 mozilla::DataChannelRegistry::Deregister(unsigned long) /builds/worker/checkouts/gecko/netwerk/sctp/datachannel/DataChannel.cpp:114:17 (libxul.so+0x135bc93)
[task 2020-06-18T01:42:48.516Z] 01:42:48 INFO - GECKO(1265) | #5 operator() /builds/worker/checkouts/gecko/netwerk/sctp/datachannel/DataChannel.cpp:415:22 (libxul.so+0x135bb5f)
[task 2020-06-18T01:42:48.517Z] 01:42:48 INFO - GECKO(1265) | #6 mozilla::detail::RunnableFunction<mozilla::DataChannelConnection::DestroyOnSTS(socket*, socket*)::$_0>::Run() /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:574:5 (libxul.so+0x135bb5f)
[task 2020-06-18T01:42:48.518Z] 01:42:48 INFO - GECKO(1265) | #7 nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1234:14 (libxul.so+0xb13602)
[task 2020-06-18T01:42:48.519Z] 01:42:48 INFO - GECKO(1265) | #8 NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:501:10 (libxul.so+0xb188f2)
[task 2020-06-18T01:42:48.520Z] 01:42:48 INFO - GECKO(1265) | #9 mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:87:21 (libxul.so+0x13e6bd9)
[task 2020-06-18T01:42:48.521Z] 01:42:48 INFO - GECKO(1265) | #10 mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:270:30 (libxul.so+0x13e75eb)
[task 2020-06-18T01:42:48.523Z] 01:42:48 INFO - GECKO(1265) | #11 RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:315:10 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.524Z] 01:42:48 INFO - GECKO(1265) | #12 RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:308:3 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.525Z] 01:42:48 INFO - GECKO(1265) | #13 MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:290:3 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.529Z] 01:42:48 INFO - GECKO(1265) | #14 nsBaseAppShell::Run() /builds/worker/checkouts/gecko/widget/nsBaseAppShell.cpp:137:27 (libxul.so+0x4874833)
[task 2020-06-18T01:42:48.534Z] 01:42:48 INFO - GECKO(1265) | #15 XRE_RunAppShell() /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:913:20 (libxul.so+0x647ac29)
[task 2020-06-18T01:42:48.536Z] 01:42:48 INFO - GECKO(1265) | #16 mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:237:9 (libxul.so+0x13e759a)
[task 2020-06-18T01:42:48.537Z] 01:42:48 INFO - GECKO(1265) | #17 RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:315:10 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.538Z] 01:42:48 INFO - GECKO(1265) | #18 RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:308:3 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.538Z] 01:42:48 INFO - GECKO(1265) | #19 MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:290:3 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.539Z] 01:42:48 INFO - GECKO(1265) | #20 XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:744:34 (libxul.so+0x647a9a9)
[task 2020-06-18T01:42:48.539Z] 01:42:48 INFO - GECKO(1265) | #21 mozilla::BootstrapImpl::XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/Bootstrap.cpp:67:12 (libxul.so+0x6482aa2)
[task 2020-06-18T01:42:48.540Z] 01:42:48 INFO - GECKO(1265) | #22 content_process_main /builds/worker/checkouts/gecko/browser/app/../../ipc/contentproc/plugin-container.cpp:56:28 (firefox+0xca0d7)
[task 2020-06-18T01:42:48.540Z] 01:42:48 INFO - GECKO(1265) | #23 main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:303:18 (firefox+0xca0d7)
[task 2020-06-18T01:42:48.541Z] 01:42:48 INFO - GECKO(1265) | Previous read of size 1 at 0x7f24a2682210 by thread T32:
[task 2020-06-18T01:42:48.541Z] 01:42:48 INFO - GECKO(1265) | #0 user_sctp_timer_iterate /builds/worker/checkouts/gecko/netwerk/sctp/src/netinet/sctp_callout.c:293:7 (libxul.so+0x12c567c)
[task 2020-06-18T01:42:48.545Z] 01:42:48 INFO - GECKO(1265) | Location is global 'system_base_info' of size 1824 at 0x7f24a2681b10 (libxul.so+0x00000b332210)
[task 2020-06-18T01:42:48.545Z] 01:42:48 INFO - GECKO(1265) | Mutex M1124909298406113120 is already destroyed.
Reporter | ||
Comment 1•4 years ago
|
||
Any ideas? Is there something we should be doing before calling usrsctp_finish?
Comment 2•4 years ago
|
||
No, close all sockets is what you have to do. Can you tell me, what it complaint is concrete? Not sure which code you are using...
Reporter | ||
Comment 3•4 years ago
|
||
The TSan output in the description corresponds to a case where we call usrsctp_finalize after the last socket is closed, and the last address is deregistered. It seems that there is a timer thread still iterating at this point. Here's the racy bit from the thread calling usrsctp_finish:
And here's the racy bit from the timer thread:
It looks like this is one of those cases where you have a bool guard that needs to be atomic to keep TSan happy.
Reporter | ||
Comment 4•4 years ago
|
||
You could use the c11 atomics library, if you're willing to require support for c11.
Comment 5•4 years ago
|
||
But what is the problem? It is waited until the timer thread exited, it might take a bit longer, but it will work. Am I missing something? Can you explain under which conditions something bad would happen?
Reporter | ||
Comment 6•4 years ago
•
|
||
The race might be benign, but TSan cannot determine that. Fixing benign races is part of the price of using TSan, which is a price worth paying. It's a little like compiler warnings. Sometimes they end up being safe, but it is better to fix them if possible so you can catch real problems.
Updated•4 years ago
|
Comment 7•4 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #6)
The race might be benign, but TSan cannot determine that. Fixing benign races is part of the price of using TSan, which is a price worth paying. It's a little like compiler warnings. Sometimes they end up being safe, but it is better to fix them if possible so you can catch real problems.
Understood. But I need to understand what the warning is about to fix or silence it. Here I'm not sure what TSan is complaining about. I do see Taylors issue, think it is harmless, but don't know how to silence it, since reads are atomic.
But for this issue, I need to know what TSan is complaining about...
Reporter | ||
Comment 8•4 years ago
|
||
(In reply to Michael Tüxen from comment #7)
(In reply to Byron Campen [:bwc] from comment #6)
The race might be benign, but TSan cannot determine that. Fixing benign races is part of the price of using TSan, which is a price worth paying. It's a little like compiler warnings. Sometimes they end up being safe, but it is better to fix them if possible so you can catch real problems.
Understood. But I need to understand what the warning is about to fix or silence it. Here I'm not sure what TSan is complaining about. I do see Taylors issue, think it is harmless, but don't know how to silence it, since reads are atomic.
But for this issue, I need to know what TSan is complaining about...
So reads are essentially never really guaranteed to be atomic unless you use atomic primitives. (see https://software.intel.com/content/www/us/en/develop/blogs/benign-data-races-what-could-possibly-go-wrong.html)
Description
•