Closed Bug 1614605 Opened 2 months ago Closed 1 month ago

ThreadSanitizer: lock-order-inversion (potential deadlock) [@ PR_Lock] in SanctionsTestServer

Categories

(Core :: Security: PSM, defect, P1)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla75
Tracking Status
firefox74 --- wontfix
firefox75 --- fixed

People

(Reporter: decoder, Assigned: decoder)

References

(Blocks 1 open bug)

Details

(Whiteboard: [psm-assigned])

Attachments

(2 files)

The attached crash information was detected while running CI tests with ThreadSanitizer on mozilla-central revision ee653474267a.

This potential deadlock report refers to the SanctionsTestServer used in one XPCShell test. Filing this so we are aware of it and I will suppress the issue for now.

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf

Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

It looks like OCSPStaplingServer has the same problem and the test using that also has intermittent failures (not sure if these are related).

This is basically that DoSNISocketConfig in both utilities needs mutual exclusion -- at least as ConfigSecureServerWithNamedCert is designed, since it's reusing the internal slot.

I'm pretty sure this is a false positive. These servers only run on a single thread (the other thread just polls the parent to see if it has died - it doesn't touch any NSS stuff). Going by comment 0 this warning has already been suppressed, so I'm going to close this wfm.

Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → WORKSFORME

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #4)

I'm pretty sure this is a false positive. These servers only run on a single thread (the other thread just polls the parent to see if it has died - it doesn't touch any NSS stuff). Going by comment 0 this warning has already been suppressed, so I'm going to close this wfm.

TSan does not produce these types of false positive to our knowledge. Suppressions are temporary to get the tests to work but they are not a permanent solution (they cost performance).

:jcj seems to have found the problem in comment 3 and that should be addressed.

If you still believe this to be a false positive, then we need to find the source of the false positive and fix it, because we can't afford this tool to produce false positives in the first place. We can also run this test with TSAN_OPTIONS=second_deadlock_stack=1 for a better report.

Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---

Dana's right of course, this is single-threaded, so even with lock-order inversion it's not a race. Besides which there's no race in the Firefox code, this is a mochitest service which was identified.

Mutual exclusion into DoSNISocketConfig could still be done, but it's literally single-threaded, so it'd just be to satisfy TSAN.

I agree, re-resolving WORKSFORME.

Status: REOPENED → RESOLVED
Closed: 2 months ago2 months ago
Resolution: --- → WORKSFORME

For future reference, this appears to be a known issue: https://github.com/google/sanitizers/issues/488 (thanks :kjacobs!)

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #7)

For future reference, this appears to be a known issue: https://github.com/google/sanitizers/issues/488 (thanks :kjacobs!)

Thanks for figuring this out, this is valuable for us to know! I will also reach out to Dmitry and see if there is any progress on the issue.

(In reply to J.C. Jones [:jcj] (he/him) from comment #6)

I agree, re-resolving WORKSFORME.

We still can't just close the bug. Suppressions are temporary unless they are marked explicitly permanent with a reason (otherwise, automating which suppressions are obsolete becomes challenging). I will make the necessary changes and flag you for review.

Assignee: nobody → choller
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---

The priority flag is not set for this bug.
:keeler, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(dkeeler)
Flags: needinfo?(dkeeler)
Priority: -- → P1
Whiteboard: [psm-assigned]
Pushed by choller@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8924ccb67af0
Add permanent deadlock suppressions for single thread. r=keeler
Status: REOPENED → RESOLVED
Closed: 2 months ago1 month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75
You need to log in before you can comment on or make changes to this bug.