ThreadSanitizer: data race [@ get] vs. [@ assign_assuming_AddRef] ([@ CurrentDriver] vs. [@ mozilla::MediaTrackGraphImpl::SetCurrentDriver])
Categories
(Core :: Audio/Video, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox72 | --- | unaffected |
firefox73 | --- | fixed |
firefox74 | --- | fixed |
People
(Reporter: decoder, Assigned: padenot)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression)
Crash Data
Attachments
(2 files)
22.96 KB,
text/plain
|
Details | |
47 bytes,
text/x-phabricator-request
|
jcristau
:
approval-mozilla-beta+
|
Details | Review |
The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision 12fb5e522dd3.
A quick look at the code seems to indicate that this is unexpected: Both CurrentDriver
and mozilla::MediaTrackGraphImpl::SetCurrentDriver
have a comment on them that the monitor lock needs to be held, which apparently is not the case here.
General information about TSan reports
Why fix races?
Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.
Rating
If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.
False Positives / Benign Races
Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].
[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races
If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp
. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.
Reporter | ||
Comment 1•5 years ago
|
||
Comment 2•5 years ago
|
||
The access of the mDriver
must happen with the monitor held if it is not in the graph (audio) thread or the audio thread is not running. CurrentDriver()
method, on debug, verifies that. We need to suppress it or to update the comment if that fixes the report.
Reporter | ||
Comment 3•5 years ago
|
||
(In reply to Alex Chronopoulos [:achronop] from comment #2)
The access of the
mDriver
must happen with the monitor held if it is not in the graph (audio) thread or the audio thread is not running.CurrentDriver()
method, on debug, verifies that. We need to suppress it or to update the comment if that fixes the report.
Please note that the report is still valid, we are concurrently setting and reading mDriver
at the same time from two different threads (main thread and MediaGraphThread). Just ignoring this race is likely not the right way to move forward here, unless the mDriver
update was not effectful (overwriting mDriver
with the same value that it already had).
Comment 4•5 years ago
|
||
Hmm the way I read it is that we access the mDriver
(through CurrentDriver()
) which has been previously written by T1 ('MediaTrackGrph'). The tricky point is that it does not happen in parallel. Actually, at that point, the T1 thread has been closed and deleted. This is by design. The way that the code works allows access to mDriver
without the mutex held if the second thread (T1) is closed. The code in debug has several checks to verify that like the one I linked above. That is why I believe is safe. In any case, I can be missing something. I am pulling in Paul for another pair of eyes.
Reporter | ||
Comment 5•5 years ago
|
||
(In reply to Alex Chronopoulos [:achronop] from comment #4)
Hmm the way I read it is that we access the
mDriver
(throughCurrentDriver()
) which has been previously written by T1 ('MediaTrackGrph'). The tricky point is that it does not happen in parallel. Actually, at that point, the T1 thread has been closed and deleted.
TSan tracks the life-cycle of all threads and the thread T1 is still alive and running according to TSan when the access happens.
Assignee | ||
Comment 6•5 years ago
|
||
The issue is that the write is not synchronized. The read is done locked, but not the write.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ccc34d4c68dcfe7cb63e9a55ff029a47a013db1b
Assignee | ||
Comment 7•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Adding priority to remove from triage.
Comment 10•5 years ago
|
||
bugherder |
Comment 12•5 years ago
|
||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 13•5 years ago
|
||
Comment on attachment 9121836 [details]
Bug 1609400 - Lock when setting a graph's current driver. r?achronop
Beta/Release Uplift Approval Request
- User impact if declined: Anything.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This makes behavior more similar to how it was prior to the change that caused the regression. There is a risk that this somehow interferes with new code, but I've looked over the usage in current code and I don't see any significant differences on beta.
- String changes made/needed: None.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 15•5 years ago
|
||
Comment on attachment 9121836 [details]
Bug 1609400 - Lock when setting a graph's current driver. r?achronop
approved for 73.0b12, thanks
Comment 16•5 years ago
|
||
bugherder uplift |
Updated•5 years ago
|
Description
•