TSan: data race netwerk/sctp/src/netinet/sctputil.c sctp_timer_start (many races on sctp_timer fields)

NEW
Unassigned

Status

()

Core
WebRTC
P3
normal
Rank:
25
2 years ago
5 months ago

People

(Reporter: froydnj, Unassigned)

Tracking

(Blocks: 1 bug)

unspecified
x86_64
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [tsan])

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
Created attachment 8680182 [details]
TSan stack trace

The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

I'm filing this in WebRTC rather than Networking or Audio/Video because it's my impression that the sctp stuff is used for WebRTC-y things.  Please correct me if I'm mistaken!

* Specific information about this bug

It looks like what's happening is that we have essentially statically allocated timers that are being reused.  So a given timer expires in sctp_timeout_handler, which reads fields of that timer, and then the timer is set on a different thread, which writes fields of that timer, which in turn makes TSan unhappy.

It looks like there's a mutex held during writing, so perhaps the argument is that unlocking the mutex makes those fields visible across threads, and the reads wouldn't happen simultaneously anyway?

jesup, do you know who the right person is to verify the veracity of that explanation?

* General information about TSan, data races, etc.

Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2].

If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Flags: needinfo?(rjesup)
(Reporter)

Comment 1

2 years ago
Looks like there are some other races on sctp_timers in the same vicinity too...
-> Michael...
TSAN is a harsh master.... And as this is C code, Atomic isn't available.
Flags: needinfo?(rjesup) → needinfo?(tuexen)

Updated

2 years ago
backlog: --- → webrtc/webaudio+
Rank: 25
Priority: -- → P2

Comment 3

2 years ago
There was a similar report for the usrsctp stack:
https://github.com/sctplab/usrsctp/pull/22
The following fix was committed:
https://github.com/sctplab/usrsctp/commit/6db971f6f1392a55f2d481c6fd1abc8602cc125a
Does this address the issue?
Flags: needinfo?(tuexen)
(In reply to Michael Tüxen from comment #3)
> There was a similar report for the usrsctp stack:
> https://github.com/sctplab/usrsctp/pull/22
> The following fix was committed:
> https://github.com/sctplab/usrsctp/commit/
> 6db971f6f1392a55f2d481c6fd1abc8602cc125a
> Does this address the issue?

No, that's something else entirely.  This is on sctputil.c:2272 in rev 9209, which is tmr->tcb = (void*)stcb;

> It looks like there's a mutex held during writing, so perhaps the argument is that unlocking the
> mutex makes those fields visible across threads, and the reads wouldn't happen simultaneously anyway?

Well, locking just the writes doesn't solve the TSAN problem; the compiler can reuse your storage as a temp/spill/etc.  Aldo in sctp_timer_start() there's a conditional lock; it doesn't assert lock held if sctb is null (and in timeout, it doesn't grab the lock if sctb is null; plus it accesses tmr->tcb before locking - especially since in timeout sctb = tmr->tcb to begin with -- that's the "previous read" in the TSAN trace).

Looks to me like it was written under the sctb lock (which is for the sctb itself I think) because sctb was non-null; and the previous read was not under lock (and couldn't be under an sctb lock).

I suspect some re-think of the time locking strategies is needed.

Comment 5

2 years ago
I saw the mismatch of the line number, but wasn't sure on which source code base the tests were performed. What is specific to the userland stack is the timer queue which is protected by a lock. I know that part of the code Randy wrote doesn't always hold locks if he thought that it doesn't matter. I'll try to understand what the output means and what the problem would be. Is this TSAN testing done with valgrind? Then I can try to reproduce issues in simple test programs...
(In reply to Michael Tüxen from comment #5)
> I saw the mismatch of the line number, but wasn't sure on which source code
> base the tests were performed. What is specific to the userland stack is the
> timer queue which is protected by a lock. I know that part of the code Randy
> wrote doesn't always hold locks if he thought that it doesn't matter. I'll
> try to understand what the output means and what the problem would be. Is
> this TSAN testing done with valgrind? Then I can try to reproduce issues in
> simple test programs...

https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctputil.c#2272

Due to TSAN, locks rarely don't matter when multiple threads can access (unless atomic, or unless access is moderated by a message-pass).  TSAN sucks, but that's due to c/c++ being defined in the world of a single-threaded virtual machine assumption.

This is likely running the full browser using a TSAN-enabled build of mozilla-central (build with clang and TSAN: https://developer.mozilla.org/en-US/docs/Thread_Sanitizer).  TSAN is a runtime check, so with a TSAN build run some datachannel tests/applications and see what happens to get hit.  Some will get hit constantly, others will be hit only in certain races.
Mass change P2->P3 to align with new Mozilla triage process.
Priority: P2 → P3
You need to log in before you can comment on or make changes to this bug.