Open Bug 1612212 Opened 4 years ago Updated 2 years ago

Write a test for TSan race detection on RefPtr/nsCOMPtr referring thread-safe objects

Categories

(Core :: Fuzzing, task)

task

Tracking

()

People

(Reporter: mayhemer, Assigned: decoder)

References

(Blocks 1 open bug)

Details

We want to make sure that TSan can detect races on smart pointers referring objects with otherwise thread-safe ref counters.

I am not sure what this bug is about. Why is it necessary to do anything specific about this? The reference counter being thread-safe/atomic is unrelated to a RefPtr variable being thread-safe/atomic. In my understanding, the TSan would detect unsynchronized write/read or write/write races on a RefPtr variable, as with any other memory access.

We are not sure if the release/acquire read/write on the ref counter is not considered by TSan as sufficient-enough synchronization for the RefPtr::mRawPtr member too. It's likely that that will be detected as you write, we just want to make sure and also have a regression-like test when we change some login in or around our smart pointers.

Thanks for the clarification! Having a regression test for that makes sense to me as well, as it is a critical point.

I've implemented a proof-of-concept GTest and confirmed that the problem is detected as expected:

#include "nsISupportsImpl.h"

class RefCounted {
    NS_INLINE_DECL_THREADSAFE_REFCOUNTING(RefCounted)

public:
    RefCounted(const std::string& name) : mName(name) {}

    void PrintName(const std::string& who) {
      fprintf(stderr, "%s: %s\n", who.c_str(), mName.c_str());
    }

private:
    ~RefCounted() = default;

    std::string mName;
};

RefPtr<RefCounted> gRefPtr;

void *RacyThread(void* arg) {
    RefPtr<RefCounted> local = gRefPtr; // Line 22 - Race report expected here.
    local->PrintName("Racing Thread");
    return nullptr;
}

TEST(race_refptr, races) {
    gRefPtr = new RefCounted("Object1");

    pthread_t racyThread;
    if (pthread_create(&racyThread, nullptr, RacyThread, nullptr)) {
      MOZ_CRASH("Error creating thread");
    }

    gRefPtr = new RefCounted("Object2"); // Line 35 - Race report expected here.
    gRefPtr->PrintName("Main Thread");

    int *ret = 0;
    pthread_join(racyThread, (void**)&ret);
}

And running that, we get:

WARNING: ThreadSanitizer: data race (pid=1288)
  Write of size 8 at 0x7f31e6a68040 by main thread:
    #0 assign_assuming_AddRef objdir/dist/include/mozilla/RefPtr.h:67:13
    #1 assign_with_AddRef objdir/dist/include/mozilla/RefPtr.h:62:5
    #2 operator= objdir/dist/include/mozilla/RefPtr.h:190:5
    #3 race_refptr_races_Test::TestBody() mfbt/tests/gtest/TestRace.cpp:35:13
    #4 HandleExceptionsInMethodIfSupported<testing::Test, void> testing/gtest/gtest/src/gtest.cc
    #5 testing::Test::Run() testing/gtest/gtest/src/gtest.cc:2519:5
    [...]
    #18 main browser/app/nsBrowserApp.cpp:339:16

  Previous read of size 8 at 0x7f31e6a68040 by thread T15:
    #0 RefPtr objdir/dist/include/mozilla/RefPtr.h:93:27
    #1 RacyThread(void*) mfbt/tests/gtest/TestRace.cpp:22:32

  Location is global 'gRefPtr' of size 8 at 0x7f31e6a68040

  Thread T15 (tid=1323, running) created by main thread at:
    #0 pthread_create <null>
    #1 race_refptr_races_Test::TestBody() mfbt/tests/gtest/TestRace.cpp:31:9
    #2 HandleExceptionsInMethodIfSupported<testing::Test, void> testing/gtest/gtest/src/gtest.cc
    [...]
    #16 main browser/app/nsBrowserApp.cpp:339:16

SUMMARY: ThreadSanitizer: data race objdir/dist/include/mozilla/RefPtr.h:67:13 in assign_assuming_AddRef

which is exactly the race we have been expecting. Honza, does that look right to you?

Right now, the test passes nonetheless because the GTest framework does not seem to handle the TSan abort correctly. We use abort_on_error=1 by default, which aborts at the end of the process lifetime if a race was detected. It can be combined with halt_on_error=1 to abort immediately. However, none of these options seem to properly work with GTests. If we wanted this to be a true regression test, we probably also would want to parse the stack and ensure that what is reported matches somewhat our expectations (or at least the summary).

Blocks: tsan
Flags: needinfo?(honzab.moz)

(In reply to Christian Holler (:decoder) from comment #4)

which is exactly the race we have been expecting. Honza, does that look right to you?

Yes, excellent!

Right now, the test passes nonetheless because the GTest framework does not seem to handle the TSan abort correctly. We use abort_on_error=1 by default, which aborts at the end of the process lifetime if a race was detected. It can be combined with halt_on_error=1 to abort immediately. However, none of these options seem to properly work with GTests. If we wanted this to be a true regression test, we probably also would want to parse the stack and ensure that what is reported matches somewhat our expectations (or at least the summary).

Hmm.. as this is a really nasty one, we should somehow make this actually fallible. Not sure if this should be done in a followup or here. Note that we also have a notion of C++ tests, like https://searchfox.org/mozilla-central/search?q=TestAtomics&case=true&regexp=false&path=. Maybe if you rewrite your test as a CppTest it will work better with TSan abortions?

Thanks again!

Flags: needinfo?(honzab.moz) → needinfo?(choller)

(In reply to Honza Bambas (:mayhemer) from comment #5)

(In reply to Christian Holler (:decoder) from comment #4)

which is exactly the race we have been expecting. Honza, does that look right to you?

Yes, excellent!

Right now, the test passes nonetheless because the GTest framework does not seem to handle the TSan abort correctly. We use abort_on_error=1 by default, which aborts at the end of the process lifetime if a race was detected. It can be combined with halt_on_error=1 to abort immediately. However, none of these options seem to properly work with GTests. If we wanted this to be a true regression test, we probably also would want to parse the stack and ensure that what is reported matches somewhat our expectations (or at least the summary).

Hmm.. as this is a really nasty one, we should somehow make this actually fallible. Not sure if this should be done in a followup or here. Note that we also have a notion of C++ tests, like https://searchfox.org/mozilla-central/search?q=TestAtomics&case=true&regexp=false&path=. Maybe if you rewrite your test as a CppTest it will work better with TSan abortions?

This might be a solution for this particular test, but I guess we would want to be able to run all GTests with TSan and detect failures.

(In reply to Simon Giesecke [:sg] [he/him] from comment #6)

This might be a solution for this particular test, but I guess we would want to be able to run all GTests with TSan and detect failures.

It would be great if we could, but from what I remember, GTests have a ton of "test-only" races because they can spawn threads directly and some tests do so. Fixing a ton of test-only bugs just to get GTests running (which don't really reflect how threading in Firefox is used anyway) doesn't sound like an investment that would want to prioritize right now.

In any case, I will try to find an owner for this bug and we can decide based on how much work it would be to get this particular test implemented without running all of the gtests.

Flags: needinfo?(choller)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.