Open Bug 1646716 Opened 4 years ago Updated 2 years ago

usrsctp_finish is racy

Categories

(Core :: WebRTC: Networking, defect, P3)

defect

Tracking

()

People

(Reporter: bwc, Unassigned)

References

Details

Observed this failure on TSan when trying to call usrsctp_finish after we shut down and deregister all of our DataChannelConnections:

[task 2020-06-18T01:42:48.511Z] 01:42:48 INFO - GECKO(1265) | WARNING: ThreadSanitizer: data race (pid=1405)
[task 2020-06-18T01:42:48.512Z] 01:42:48 INFO - GECKO(1265) | Write of size 1 at 0x7f24a2682210 by main thread (mutexes: write M1124909298406113120):
[task 2020-06-18T01:42:48.512Z] 01:42:48 INFO - GECKO(1265) | #0 sctp_finish /builds/worker/checkouts/gecko/netwerk/sctp/src/netinet/sctp_usrreq.c:236:42 (libxul.so+0x1324d16)
[task 2020-06-18T01:42:48.512Z] 01:42:48 INFO - GECKO(1265) | #1 usrsctp_finish /builds/worker/checkouts/gecko/netwerk/sctp/src/user_socket.c:2236:2 (libxul.so+0x1344f82)
[task 2020-06-18T01:42:48.514Z] 01:42:48 INFO - GECKO(1265) | #2 DeinitUsrSctp /builds/worker/checkouts/gecko/netwerk/sctp/datachannel/DataChannel.cpp:251:5 (libxul.so+0x135c07b)
[task 2020-06-18T01:42:48.515Z] 01:42:48 INFO - GECKO(1265) | #3 mozilla::DataChannelRegistry::DeregisterImpl(unsigned long) /builds/worker/checkouts/gecko/netwerk/sctp/datachannel/DataChannel.cpp:191:7 (libxul.so+0x135c07b)
[task 2020-06-18T01:42:48.515Z] 01:42:48 INFO - GECKO(1265) | #4 mozilla::DataChannelRegistry::Deregister(unsigned long) /builds/worker/checkouts/gecko/netwerk/sctp/datachannel/DataChannel.cpp:114:17 (libxul.so+0x135bc93)
[task 2020-06-18T01:42:48.516Z] 01:42:48 INFO - GECKO(1265) | #5 operator() /builds/worker/checkouts/gecko/netwerk/sctp/datachannel/DataChannel.cpp:415:22 (libxul.so+0x135bb5f)
[task 2020-06-18T01:42:48.517Z] 01:42:48 INFO - GECKO(1265) | #6 mozilla::detail::RunnableFunction<mozilla::DataChannelConnection::DestroyOnSTS(socket*, socket*)::$_0>::Run() /builds/worker/workspace/obj-build/dist/include/nsThreadUtils.h:574:5 (libxul.so+0x135bb5f)
[task 2020-06-18T01:42:48.518Z] 01:42:48 INFO - GECKO(1265) | #7 nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1234:14 (libxul.so+0xb13602)
[task 2020-06-18T01:42:48.519Z] 01:42:48 INFO - GECKO(1265) | #8 NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:501:10 (libxul.so+0xb188f2)
[task 2020-06-18T01:42:48.520Z] 01:42:48 INFO - GECKO(1265) | #9 mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:87:21 (libxul.so+0x13e6bd9)
[task 2020-06-18T01:42:48.521Z] 01:42:48 INFO - GECKO(1265) | #10 mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:270:30 (libxul.so+0x13e75eb)
[task 2020-06-18T01:42:48.523Z] 01:42:48 INFO - GECKO(1265) | #11 RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:315:10 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.524Z] 01:42:48 INFO - GECKO(1265) | #12 RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:308:3 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.525Z] 01:42:48 INFO - GECKO(1265) | #13 MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:290:3 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.529Z] 01:42:48 INFO - GECKO(1265) | #14 nsBaseAppShell::Run() /builds/worker/checkouts/gecko/widget/nsBaseAppShell.cpp:137:27 (libxul.so+0x4874833)
[task 2020-06-18T01:42:48.534Z] 01:42:48 INFO - GECKO(1265) | #15 XRE_RunAppShell() /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:913:20 (libxul.so+0x647ac29)
[task 2020-06-18T01:42:48.536Z] 01:42:48 INFO - GECKO(1265) | #16 mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:237:9 (libxul.so+0x13e759a)
[task 2020-06-18T01:42:48.537Z] 01:42:48 INFO - GECKO(1265) | #17 RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:315:10 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.538Z] 01:42:48 INFO - GECKO(1265) | #18 RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:308:3 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.538Z] 01:42:48 INFO - GECKO(1265) | #19 MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:290:3 (libxul.so+0x1372abc)
[task 2020-06-18T01:42:48.539Z] 01:42:48 INFO - GECKO(1265) | #20 XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/nsEmbedFunctions.cpp:744:34 (libxul.so+0x647a9a9)
[task 2020-06-18T01:42:48.539Z] 01:42:48 INFO - GECKO(1265) | #21 mozilla::BootstrapImpl::XRE_InitChildProcess(int, char**, XREChildData const*) /builds/worker/checkouts/gecko/toolkit/xre/Bootstrap.cpp:67:12 (libxul.so+0x6482aa2)
[task 2020-06-18T01:42:48.540Z] 01:42:48 INFO - GECKO(1265) | #22 content_process_main /builds/worker/checkouts/gecko/browser/app/../../ipc/contentproc/plugin-container.cpp:56:28 (firefox+0xca0d7)
[task 2020-06-18T01:42:48.540Z] 01:42:48 INFO - GECKO(1265) | #23 main /builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp:303:18 (firefox+0xca0d7)
[task 2020-06-18T01:42:48.541Z] 01:42:48 INFO - GECKO(1265) | Previous read of size 1 at 0x7f24a2682210 by thread T32:
[task 2020-06-18T01:42:48.541Z] 01:42:48 INFO - GECKO(1265) | #0 user_sctp_timer_iterate /builds/worker/checkouts/gecko/netwerk/sctp/src/netinet/sctp_callout.c:293:7 (libxul.so+0x12c567c)
[task 2020-06-18T01:42:48.545Z] 01:42:48 INFO - GECKO(1265) | Location is global 'system_base_info' of size 1824 at 0x7f24a2681b10 (libxul.so+0x00000b332210)
[task 2020-06-18T01:42:48.545Z] 01:42:48 INFO - GECKO(1265) | Mutex M1124909298406113120 is already destroyed.

Any ideas? Is there something we should be doing before calling usrsctp_finish?

Flags: needinfo?(tuexen)

No, close all sockets is what you have to do. Can you tell me, what it complaint is concrete? Not sure which code you are using...

Flags: needinfo?(tuexen)

The TSan output in the description corresponds to a case where we call usrsctp_finalize after the last socket is closed, and the last address is deregistered. It seems that there is a timer thread still iterating at this point. Here's the racy bit from the thread calling usrsctp_finish:

https://searchfox.org/mozilla-central/rev/eef39502e08bcd3c40573c65a6827828dce4a032/netwerk/sctp/src/netinet/sctp_usrreq.c#236

And here's the racy bit from the timer thread:

https://searchfox.org/mozilla-central/rev/eef39502e08bcd3c40573c65a6827828dce4a032/netwerk/sctp/src/netinet/sctp_callout.c#293

It looks like this is one of those cases where you have a bool guard that needs to be atomic to keep TSan happy.

You could use the c11 atomics library, if you're willing to require support for c11.

See Also: → 1645219

But what is the problem? It is waited until the timer thread exited, it might take a bit longer, but it will work. Am I missing something? Can you explain under which conditions something bad would happen?

The race might be benign, but TSan cannot determine that. Fixing benign races is part of the price of using TSan, which is a price worth paying. It's a little like compiler warnings. Sometimes they end up being safe, but it is better to fix them if possible so you can catch real problems.

Severity: -- → S3
Priority: -- → P3
See Also: → 1646134

(In reply to Byron Campen [:bwc] from comment #6)

The race might be benign, but TSan cannot determine that. Fixing benign races is part of the price of using TSan, which is a price worth paying. It's a little like compiler warnings. Sometimes they end up being safe, but it is better to fix them if possible so you can catch real problems.

Understood. But I need to understand what the warning is about to fix or silence it. Here I'm not sure what TSan is complaining about. I do see Taylors issue, think it is harmless, but don't know how to silence it, since reads are atomic.

But for this issue, I need to know what TSan is complaining about...

(In reply to Michael Tüxen from comment #7)

(In reply to Byron Campen [:bwc] from comment #6)

The race might be benign, but TSan cannot determine that. Fixing benign races is part of the price of using TSan, which is a price worth paying. It's a little like compiler warnings. Sometimes they end up being safe, but it is better to fix them if possible so you can catch real problems.

Understood. But I need to understand what the warning is about to fix or silence it. Here I'm not sure what TSan is complaining about. I do see Taylors issue, think it is harmless, but don't know how to silence it, since reads are atomic.

But for this issue, I need to know what TSan is complaining about...

So reads are essentially never really guaranteed to be atomic unless you use atomic primitives. (see https://software.intel.com/content/www/us/en/develop/blogs/benign-data-races-what-could-possibly-go-wrong.html)

Blocks: 1724420
You need to log in before you can comment on or make changes to this bug.