ThreadSanitizer: data race [@ nsTimerImpl::Shutdown] vs. [@ nsTimerImpl::CancelImpl]
Categories
(Core :: XPCOM, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox91 | --- | fixed |
People
(Reporter: Gankra, Assigned: KrisWright)
References
(Blocks 1 open bug)
Details
Attachments
(4 files)
Found while testing out turning on the full shutdown code for tsan (that doesn't just exit). Presumably not a security issue since this code is only for testing, and doesn't run in production?
Looks like gThread needs to be atomically managed.
Permalinks to problematic lines:
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
Comment 1•4 years ago
|
||
I think I've seen a TSan bug before for this timer shutdown race, but maybe not.
We do run this code in non-debug builds in the parent process.
Comment 2•4 years ago
|
||
I think bug 1647417 is the other bug I was thinking of.
Reporter | ||
Comment 4•4 years ago
|
||
Good catch, thanks!
Reporter | ||
Comment 5•4 years ago
|
||
Depends on D100267
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Comment 7•4 years ago
|
||
bugherder |
Reporter | ||
Updated•4 years ago
|
Assignee | ||
Comment 8•4 years ago
|
||
It looks like this race may be involved in some recent shutdown hangs. I'm preparing a patch for this, but given how we use gThread
any fix is nontrivial. I'm trying to wrap the thread itself in a threadsafe event target, but I'm not sure how to subvert races late into shutdown. Maybe this theoretical wrapper class should shut down gThread
with timer shutdown but the wrapper itself shouldn't be dereferenced until after we shut down nsThreadManager
in case of late timer checks.
Assignee | ||
Comment 9•3 years ago
|
||
=Note that this work needs to be rebased on top of Bug 1711090
This particular race is a tricky one - there's no perfect solution to protecting the timer thread from being called in cancel
while being dereferenced. This ensures that we won't run into that problem by locking all of our TimerThread calls behind a mutex inside a wrapper class. Then we hold onto the wrapper class until after we shutdown nsThreadManager
, in which case no background thread pools should be active anymore to call nsTimerImpl::Cancel
.
For reference, the worst-case scenario happens when we dereference gThread
between these two calls, in which case we will get stuck on a dereferenced mutex. By putting the check and the actual call into gThread
behind a mutex maybe we can prevent this issue.
This may not be the most elegant solution, so if there is a better way to fix this then it is more than welcome.
Assignee | ||
Comment 10•3 years ago
|
||
Try push shows that the issue is fixed with the above patch in the stack: https://treeherder.mozilla.org/jobs?repo=try&revision=a9e9135464b2750a8b6fd2a19408e3537b39dbe2
Updated•3 years ago
|
Comment 11•3 years ago
|
||
Comment 12•3 years ago
|
||
bugherder |
Comment 13•3 years ago
|
||
Is there more work to do here, or can we remove the leave-open and resolve?
Assignee | ||
Comment 14•3 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #13)
Is there more work to do here, or can we remove the leave-open and resolve?
Looks like we're not running into this intermittent anymore after I removed the suppression, so I'll resolve it.
Updated•3 years ago
|
Description
•