Closed Bug 558402 Opened 12 years ago Closed 12 years ago

hellgrind imagines a vtable-overwrite race in nsThread::Init vs nsThread::ThreadFunc

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

VERIFIED INVALID

People

(Reporter: jseward, Unassigned)

Details

(I'm not 100% sure of this, but I can't think of a better explanation
for what I'm seeing.)



SUMMARY: nsThread::Init and nsThread::ThreadFunc can get into a
situation where one thread is reading the vtable pointer of an object
whilst another is overwriting it.

For a short background on vtable-overwrite races, see
https://bugzilla.mozilla.org/show_bug.cgi?id=551155#c6

All line numbers in xpcom/threads/nsThread.cpp, mozilla-central



SCENARIO LEADING TO RACE

Initially there is just one thread, call it Parent.

Parent enters nsThread::Init (nsThread.cpp:310)

It creates a nsRefPtr<nsThreadStartupEvent>:

315    nsRefPtr<nsThreadStartupEvent> startup = nsThreadStartupEvent::Create();

It then creates a child thread:

323  PRThread *thr = PR_CreateThread(PR_USER_THREAD, ThreadFunc, this,
324                                  PR_PRIORITY_NORMAL, PR_GLOBAL_THREAD,
325                                  PR_JOINABLE_THREAD, 0);

The child thread somehow or other has access to the |startup| object
(doesn't matter how).  Note that it is called |event| in the child.

Parent proceeds to wait for child to signal to it via |startup|:

341  startup->Wait();

        Child is running in nsThread::ThreadFunc.

        It sends the signal the parent is waiting for, via |event|
        a.k.a |startup|

        249  event->Run();  // unblocks nsThread::Init

Now both parent and child can run in parallel, arbitrary interleaving.

        Child nulls out |event|, involving virtual method calls on
        the object via nsCOMPtr<nsIRunnable>::operator=(nsIRunnable*)

        250 event = nsnull;

Meanwhile parent reaches end of nsThread::Init.  It calls
nsRefPtr<nsThreadStartupEvent>::~nsRefPtr() on |startup|.  This call
isn't visible in the source, but is inserted by the compiler because
that's the end of the scope of auto var |startup|.

Because nsThreadStartupEvent has a virtual destructor, the destructor
overwrites the vtable pointer.  This is unsynchronised relative to the
child's virtual calls on the same object via "event = nsnull".


A couple of sanity checks:

* perhaps there isn't really a race, and Helgrind complains because it
  doesn't see the inter-thread synchronisation event caused by the
  Run() .. Wait() pair on the object.

  To rule this out I used the ANNOTATE_HAPPENS_BEFORE/AFTER client
  requests to artificially add a dependence edge (as viewed by
  Helgrind).  Result: no change, race still reported.

* analysis above implies that commenting out the "event = nsnull;"
  line should make the race disappear.  And so it does.


Basic issue is, once the child signals at the parent via event->Run(),
it cannot safely touch |event| in any way.
Here's the race report.  It contains a lot of info:

* creation points of the two threads involved
* a stack for the writing thread
* a stack for the reading thred
* a stack indicating where the raced-on block was allocated
* the value that the writing thread is about to overwrite


Thread #1 is the program's root thread

Thread #2 was created
   at 0x67EF7CE: clone (clone.S:77)
   by 0x4E34013: T.102 (createthread.c:75)
   by 0x4E34FFC: pthread_create@@GLIBC_2.2.5 (createthread.c:249)
   by 0x4C2A4AC: pthread_create_WRK (hg_intercepts.c:241)
   by 0x4C2A54F: pthread_create@* (hg_intercepts.c:268)
   by 0x5B4EC7D: _PR_CreateThread (ptthread.c:471)
   by 0x5B4EDDE: PR_CreateThread (ptthread.c:554)
   by 0x52C6C60: nsThread::Init() (nsThread.cpp:325)
   by 0x52C7ADC: nsThreadManager::NewThread(unsigned int, nsIThread**)
                 (nsThreadManager.cpp:245)
   by 0x528DD62: NS_NewThread_P(nsIThread**, nsIRunnable*)
                 (nsThreadUtils.cpp:74)
   by 0x400D86: TestThreads() (TestThreads.cpp:86)
   by 0x4013D4: main (TestThreads.cpp:249)

Possible data race during write of size 8 at 0xc6ba1b0 by thread #1
   at 0x52C76C3: nsThreadStartupEvent::~nsThreadStartupEvent()
                 (nsThread.cpp:173)
   by 0x528DED3: nsRunnable::Release() (nsThreadUtils.cpp:55)
   by 0x52C752D: nsRefPtr<nsThreadStartupEvent>::~nsRefPtr()
                 (nsAutoPtr.h:969)
   by 0x52C6CB8: nsThread::Init() (nsThread.cpp:342)
   by 0x52C7ADC: nsThreadManager::NewThread(unsigned int, nsIThread**)
                 (nsThreadManager.cpp:245)
   by 0x528DD62: NS_NewThread_P(nsIThread**, nsIRunnable*)
                 (nsThreadUtils.cpp:74)
   by 0x400D86: TestThreads() (TestThreads.cpp:86)
   by 0x4013D4: main (TestThreads.cpp:249)

 This conflicts with a previous read of size 8 by thread #2
   at 0x5288201: nsCOMPtr_base::assign_assuming_AddRef(nsISupports*)
                 (nsCOMPtr.h:456)
   by 0x528802E: nsCOMPtr_base::assign_with_AddRef(nsISupports*)
                 (nsCOMPtr.cpp:89)
   by 0x52C7640: nsCOMPtr<nsIRunnable>::operator=(nsIRunnable*)
                 (nsCOMPtr.h:640)
   by 0x52C6D2A: nsThread::ThreadFunc(void*) (nsThread.cpp:250)
   by 0x5B4E8EA: _pt_root (ptthread.c:230)
   by 0x4C2A5D8: mythread_wrapper (hg_intercepts.c:213)
   by 0x4E34A03: start_thread (pthread_create.c:300)
   by 0x67EF80C: clone (clone.S:112)

 Address 0xc6ba1b0 is 0 bytes inside a block of size 32 alloc'd
   at 0x4C25B08: malloc (vg_replace_malloc.c:236)
   by 0x551EF75: moz_xmalloc (mozalloc.cpp:75)
   by 0x52C765E: nsThreadStartupEvent::Create() (mozalloc.h:222)
   by 0x52C6BFD: nsThread::Init() (nsThread.cpp:315)
   by 0x52C7ADC: nsThreadManager::NewThread(unsigned int, nsIThread**)
                 (nsThreadManager.cpp:245)
   by 0x528DD62: NS_NewThread_P(nsIThread**, nsIRunnable*)
                 (nsThreadUtils.cpp:74)
   by 0x400D86: TestThreads() (TestThreads.cpp:86)
   by 0x4013D4: main (TestThreads.cpp:249)

 The old value at address 0xc6ba1b0 was 0x5517d30
   and 0x5517d30 is symbol "vtable for nsThreadStartupEvent+16"
It seems like there's a second race of a similar form, to do with
handling of the nsThreadShutdownAckEvent object used to coordinate
thread shutdowns:

Child is towards the end of nsThread::ThreadFunc.  It creates the
nsThreadShutdownAckEvent object ..

279   // Dispatch shutdown ACK
280   event = new nsThreadShutdownAckEvent(self->mShutdownContext);

.. signals at the parent that it may proceed, afaics by writing
PR_TRUE to mShutdownContext->shutdownAck ..

281   self->mShutdownContext->joiningThread->Dispatch(event,
NS_DISPATCH_NORMAL);
282
283   // Release any observer of the thread here.
284   self->SetObserver(nsnull);
285
286   NS_RELEASE(self);

.. and the compiler inserts a call here to
nsCOMPtr<nsIRunnable>::~nsCOMPtr() to auto-destroy |event|.

287 }


Meanwhile the parent repeatedly calls NS_ProcessNextEvent until it
sees the write of .shutdownAck from the child.

466  // Process events on the current thread until we receive a shutdown ACK.
467  while (!context.shutdownAck)
468    NS_ProcessNextEvent(context.joiningThread);

Each call to NS_ProcessNextEvent calls through to
nsThread::ProcessNextEvent(PRBool mayWait, PRBool *result).  That
processes each event with the following code:

514  {
515    // Scope for |event| to make sure that its destructor fires while
516    // mRunningEvent has been incremented, since that destructor can
517    // also do work.
518
519    // If we are shutting down, then do not wait for new events.
520    nsCOMPtr<nsIRunnable> event;
521    mEvents->GetEvent(mayWait && !ShuttingDown(), getter_AddRefs(event));
522
523    *result = (event.get() != nsnull);
524
525    if (event) {
526      LOG(("THRD(%p) running [%p]\n", this, event.get()));
527      event->Run();
528    } else if (mayWait) {
529      NS_ASSERTION(ShuttingDown(),
530                   "This should only happen when shutting down");
531      rv = NS_ERROR_UNEXPECTED;
532    }
533  }

At 532, end of the scope for |event|, it is auto-destructed.  But
that's the same |event| that was auto-destructed in the child at line
287.

---------

There's stuff I don't understand about either race.  How is memory
error is avoided (viz, the memory is apparently not deallocated until
both threads have Released it)?  Why does the the second thread to do
the Release wind up overwriting vtable pointers that are read by the
earlier thread, without any apparent coordination?

/me confused.
(In reply to comment #2)
> It seems like there's a second race of a similar form [...]
Report is:

(As above, #1 is the parent and #2 is the child.)


Possible data race during write of size 8 at 0xc739210 by thread #1
   at 0x5302463: nsThreadShutdownAckEvent::~nsThreadShutdownAckEvent()
                 (nsThread.cpp:204)
   by 0x529A41E: nsRunnable::Release() (nsThreadUtils.cpp:55)
   by 0x528EB6B: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
   by 0x53020A8: nsCOMPtr<nsIRunnable>::~nsCOMPtr() (nsCOMPtr.h:469)
   by 0x5300767: nsThread::ProcessNextEvent(int, int*) (nsThread.cpp:532)
   by 0x5299FEB: NS_ProcessNextEvent_P(nsIThread*, int)
                 (nsThreadUtils.cpp:250)
   by 0x53017BC: nsThread::Shutdown() (nsThread.cpp:468)
   by 0x401130: TestThreads() (TestThreads.cpp:99)
   by 0x4018E1: main (TestThreads.cpp:249)

 This conflicts with a previous read of size 8 by thread #2
   at 0x528EB59: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
   by 0x53020A8: nsCOMPtr<nsIRunnable>::~nsCOMPtr() (nsCOMPtr.h:469)
   by 0x530152E: nsThread::ThreadFunc(void*) (nsThread.cpp:286)
   by 0x5BCA45B: _pt_root (ptthread.c:230)
   by 0x4C2A5D8: mythread_wrapper (hg_intercepts.c:213)
   by 0x4E34A03: start_thread (pthread_create.c:300)
   by 0x686E80C: clone (clone.S:112)

 Address 0xc739210 is 0 bytes inside a block of size 24 alloc'd
   at 0x4C25B08: malloc (vg_replace_malloc.c:236)
   by 0x558E0F3: moz_xmalloc (mozalloc.cpp:75)
   by 0x530147B: nsThread::ThreadFunc(void*) (mozalloc.h:222)
   by 0x5BCA45B: _pt_root (ptthread.c:230)
   by 0x4C2A5D8: mythread_wrapper (hg_intercepts.c:213)
   by 0x4E34A03: start_thread (pthread_create.c:300)
   by 0x686E80C: clone (clone.S:112)

 The old value at address 0xc739210 was 0x5586ed0
   and 0x5586ed0 is symbol "vtable for nsThreadShutdownAckEvent+16"
Status: NEW → UNCONFIRMED
Ever confirmed: false
so, the expectation i had was that hellgrind couldn't understand that our atomicrefcount in _class::Release() guards the delete method.

NS_IMPL_THREADSAFE_RELEASE

It sounds like sewardj has reached the same conclusion, and will add an annotation there to help hellgrind out.
Summary: vtable-overwrite race in nsThread::Init vs nsThread::ThreadFunc → hellgrind imagines a vtable-overwrite race in nsThread::Init vs nsThread::ThreadFunc
Timeless is right.  Helgrind doesn't understand that the thread
causing the 1 -> 0 transition in the reference count becomes the
exclusive owner of the object.  Telling it this is a 1-liner in
NS_IMPL_THREADSAFE_RELEASE, and it makes all these apparent races go
away.

This would have been much quicker to diagnose if I had not had the
nonsensical idea that the destructors run all the way down to some
imagined base class, which does the atomic decrement, and then
possibly releases the memory.  In fact AIUI now, Release first does an
atomic decrement, then if zero runs the destructor(s), if nonzero
takes no action.

Sorry for the noise.  The appropriate annotation will be rolled in to
the patch for bug 551155.
Resolving as Invalid.
Status: UNCONFIRMED → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.