Closed Bug 1609400 Opened 5 years ago Closed 5 years ago

ThreadSanitizer: data race [@ get] vs. [@ assign_assuming_AddRef] ([@ CurrentDriver] vs. [@ mozilla::MediaTrackGraphImpl::SetCurrentDriver])

Categories

(Core :: Audio/Video, defect, P2)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla74
Tracking Status
firefox-esr68 --- unaffected
firefox72 --- unaffected
firefox73 --- fixed
firefox74 --- fixed

People

(Reporter: decoder, Assigned: padenot)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Crash Data

Attachments

(2 files)

The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision 12fb5e522dd3.

A quick look at the code seems to indicate that this is unexpected: Both CurrentDriver and mozilla::MediaTrackGraphImpl::SetCurrentDriver have a comment on them that the monitor lock needs to be held, which apparently is not the case here.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

The access of the mDriver must happen with the monitor held if it is not in the graph (audio) thread or the audio thread is not running. CurrentDriver() method, on debug, verifies that. We need to suppress it or to update the comment if that fixes the report.

(In reply to Alex Chronopoulos [:achronop] from comment #2)

The access of the mDriver must happen with the monitor held if it is not in the graph (audio) thread or the audio thread is not running. CurrentDriver() method, on debug, verifies that. We need to suppress it or to update the comment if that fixes the report.

Please note that the report is still valid, we are concurrently setting and reading mDriver at the same time from two different threads (main thread and MediaGraphThread). Just ignoring this race is likely not the right way to move forward here, unless the mDriver update was not effectful (overwriting mDriver with the same value that it already had).

Hmm the way I read it is that we access the mDriver (through CurrentDriver()) which has been previously written by T1 ('MediaTrackGrph'). The tricky point is that it does not happen in parallel. Actually, at that point, the T1 thread has been closed and deleted. This is by design. The way that the code works allows access to mDriver without the mutex held if the second thread (T1) is closed. The code in debug has several checks to verify that like the one I linked above. That is why I believe is safe. In any case, I can be missing something. I am pulling in Paul for another pair of eyes.

Flags: needinfo?(padenot)

(In reply to Alex Chronopoulos [:achronop] from comment #4)

Hmm the way I read it is that we access the mDriver (through CurrentDriver()) which has been previously written by T1 ('MediaTrackGrph'). The tricky point is that it does not happen in parallel. Actually, at that point, the T1 thread has been closed and deleted.

TSan tracks the life-cycle of all threads and the thread T1 is still alive and running according to TSan when the access happens.

The issue is that the write is not synchronized. The read is done locked, but not the write.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ccc34d4c68dcfe7cb63e9a55ff029a47a013db1b

Assignee: nobody → padenot
Flags: needinfo?(padenot)
Attachment #9121836 - Attachment is obsolete: true
Attachment #9121836 - Attachment is obsolete: false
Regressions: 1586370

Adding priority to remove from triage.

Priority: -- → P2
Pushed by padenot@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/287cc842571b Lock when setting a graph's current driver. r=achronop
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla74
Regressed by: 1586370
No longer regressions: 1586370
Has Regression Range: --- → yes

Comment on attachment 9121836 [details]
Bug 1609400 - Lock when setting a graph's current driver. r?achronop

Beta/Release Uplift Approval Request

  • User impact if declined: Anything.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This makes behavior more similar to how it was prior to the change that caused the regression. There is a risk that this somehow interferes with new code, but I've looked over the usage in current code and I don't see any significant differences on beta.
  • String changes made/needed: None.
Attachment #9121836 - Flags: approval-mozilla-beta?
Crash Signature: [@ mozilla::detail::ConditionVariableImpl::notify_one()]

Comment on attachment 9121836 [details]
Bug 1609400 - Lock when setting a graph's current driver. r?achronop

approved for 73.0b12, thanks

Attachment #9121836 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Crash Signature: [@ mozilla::detail::ConditionVariableImpl::notify_one()] → [@ mozilla::detail::ConditionVariableImpl::notify_one()] [@ mozilla::SourceMediaTrack::End()]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: