Closed Bug 844784 Opened 11 years ago Closed 10 years ago

TSan: Thread data race in PR_EnterMonitor

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

4.10.4

People

(Reporter: posidron, Assigned: wtc)

References

(Blocks 1 open bug)

Details

(Keywords: sec-want, Whiteboard: [tsan][tsan-test-blocker])

Attachments

(7 files, 7 obsolete files)

trace 11 years ago Christoph Diehl [:posidron] 29.95 KB, text/plain		Details
Patch 11 years ago Wan-Teh Chang 9.94 KB, patch		Details \| Diff \| Splinter Review
Patch v1.1 11 years ago Wan-Teh Chang 9.94 KB, patch	decoder : feedback+	Details \| Diff \| Splinter Review
Patch v2 11 years ago Wan-Teh Chang 9.95 KB, patch	jcranmer : review+	Details \| Diff \| Splinter Review
Patch v3 10 years ago Wan-Teh Chang 15.20 KB, patch	jcranmer : review+	Details \| Diff \| Splinter Review
Patch v4 10 years ago Wan-Teh Chang 15.18 KB, patch	laszio.bugzilla : review+ wtc : checked-in+	Details \| Diff \| Splinter Review
PR_AssertCurrentThreadInMonitor should be a no-op when assertions are disabled. 10 years ago Wan-Teh Chang 845 bytes, patch	laszio.bugzilla : review+ wtc : checked-in+	Details \| Diff \| Splinter Review
Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex 10 years ago Wan-Teh Chang 6.05 KB, patch	laszio.bugzilla : superreview+	Details \| Diff \| Splinter Review
Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex, v2 10 years ago Wan-Teh Chang 6.73 KB, patch	jcranmer : review+ laszio.bugzilla : superreview+ wtc : checked-in+	Details \| Diff \| Splinter Review
[For reference only] Patch to add an artificial delay in PR_ExitMonitor for a non-primordial thread 10 years ago Wan-Teh Chang 789 bytes, patch		Details \| Diff \| Splinter Review
Port the new PRMonitor code to Windows 10 years ago Wan-Teh Chang 2.48 KB, patch		Details \| Diff \| Splinter Review
[Correct patch] Port the new PRMonitor code to non-pthreads platforms (e.g., Windows) 10 years ago Wan-Teh Chang 25.07 KB, patch	jcranmer : review+	Details \| Diff \| Splinter Review
Port the new PRMonitor code to non-pthreads platforms (e.g., Windows), v2 10 years ago Wan-Teh Chang 24.96 KB, patch	wtc : checked-in+	Details \| Diff \| Splinter Review
Don't map pthread_mutex_init or pthread_cond_init error to PR_OPERATION_NOT_SUPPORTED_ERROR 10 years ago Wan-Teh Chang 621 bytes, patch	jcranmer : review+ wtc : checked-in+	Details \| Diff \| Splinter Review

Christoph Diehl [:posidron]

Reporter

Description

•

11 years ago

Attached file trace — Details

During Firefox start-up with ThreadSanitizer (LLVM version), we get a data race reported as described in the attached log. Trace was created on mozilla-central with changeset c233837cce08.

According to the TSan devs, most of the reported traces should be real data races, even though they can be "benign". We need to determine if the race can/should be fixed, or put on the ignore list. Even for benign races, TSan devs suggest to fix them (second priority), as they can also cause problems [1].

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Christian Holler (:decoder)

Comment 1

•

11 years ago

Comment says:

>     * This is safe only if mon->owner (a pthread_t) can be
>     * read in one instruction.  Perhaps mon->owner should be
>     * a "PRThread *"?

Not sure if that means the race is intentional? Maybe it can be fixed even if it is.

Wan-Teh Chang

Assignee

Comment 2

•

11 years ago

Yes, the code was intentionally written this way. I think
it should be fixed.

If "TraceMalloc" is still being used, the fix needs to avoid
calling malloc() in PR_EnterMonitor.  See bug 84035 comment 5
for a description of my last attempt at a fix. Note that we
may consider the less efficient approach (the first approach
in that comment), which I rejected last time.

Christian Holler (:decoder)

Comment 3

•

11 years ago

This is still a source for a good number of very frequent races in TSan.

@wtc: Is there anything actionable we can do here?

Flags: needinfo?(wtc)

Wan-Teh Chang

Assignee

Comment 4

•

11 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

decoder: could you test this patch?

I'd appreciate it if you could also review the patch or suggest a
reviewer. A good reviewer should be very familiar with the pthreads
mutexes and condition variables, and know how Java's "synchronized"
key word and "wait/notify/notifyAll" methods work. (NSPR's PRMonitor
was originally intended to implement those features in the JVM.)

Attachment #818599 - Flags: review?(choller)

Flags: needinfo?(wtc)

Wan-Teh Chang

Assignee

Comment 5

•

11 years ago

Attached patch Patch v1.1 (obsolete) — Details — Splinter Review

The code review tool showed some spaces at the end of lines.
I deleted those spaces.

Attachment #818599 - Attachment is obsolete: true

Attachment #818599 - Flags: review?(choller)

Attachment #818645 - Flags: review?(choller)

Christian Holler (:decoder)

Comment 6

•

11 years ago

Comment on attachment 818645 [details] [diff] [review]
Patch v1.1

I can test the patch, but unfortunately I'm not a peer for NSPR to review it. Ted might be able to review it or suggest the proper person. I'll provide feedback asap on the patch :) Thanks!

Attachment #818645 - Flags: review?(choller) → review?(ted)

Christian Holler (:decoder)

Comment 7

•

11 years ago

Comment on attachment 818645 [details] [diff] [review]
Patch v1.1

I can confirm that this fixes the race for me :)

Attachment #818645 - Flags: feedback+

Wan-Teh Chang

Assignee

Comment 8

•

11 years ago

Comment on attachment 818645 [details] [diff] [review]
Patch v1.1

Review of attachment 818645 [details] [diff] [review]:
-----------------------------------------------------------------

decoder: thank you for testing my patch.

Searching for "pthread_cond_wait" in mozilla-central, I found that pthread_cond_wait
is mainly used in third-party code. So I can't find a suitable Mozilla developer to
review this patch this way. Perhaps we can ask the owner of mfbt/Atomics.h to review
this patch?

::: pr/src/pthreads/ptsynch.c
@@ +467,5 @@
> +    if (0 != rv)
> +    {
> +        pthread_cond_destroy(&mon->entryCV);
> +        pthread_mutex_destroy(&mon->lock);
> +        PR_Free(mon);

The way I free the allocated resources on failure can be improved.
Too much code duplication!

(not currently active) Ted Mielczarek

Comment 9

•

11 years ago

I am definitely not a suitable reviewer. jcranmer is the original author of mfbt/Atomics.h, and froydnj has worked on it a lot, so either of them might be a suitable reviewer. I'd also suggest sewardj, who has a deep understanding of thread synchronization.

Joshua Cranmer [:jcranmer]

Comment 10

•

11 years ago

Comment on attachment 818645 [details] [diff] [review]
Patch v1.1

Review of attachment 818645 [details] [diff] [review]:
-----------------------------------------------------------------

::: pr/src/pthreads/ptsynch.c
@@ -529,5 @@
> -        PR_Lock(&mon->lock);
> -        /* and now I have the lock */
> -        PR_ASSERT(0 == mon->entryCount);
> -        PR_ASSERT(_PT_PTHREAD_THR_HANDLE_IS_INVALID(mon->owner));
> -        _PT_PTHREAD_COPY_THR_HANDLE(self, mon->owner);

So the original fault here looks to be the broken assumption of "all word-sized loads are atomic" and is basically a modification of the double-checked locking (anti)pattern.

It looks to me, on a cursory glance, that the non-pthreads code also has the same race condition (it's just not getting picked up because we don't run thread sanitizers on those platforms), so I wonder if this fix should apply more widely.

Christian Holler (:decoder)

Updated

•

11 years ago

Blocks: tsan

(not currently active) Ted Mielczarek

Comment 11

•

11 years ago

Comment on attachment 818645 [details] [diff] [review]
Patch v1.1

Removing this review request, I suggested three possible reviewers.

Attachment #818645 - Flags: review?(ted)

Wan-Teh Chang

Assignee

Comment 12

•

11 years ago

Comment on attachment 818645 [details] [diff] [review]
Patch v1.1

Review of attachment 818645 [details] [diff] [review]:
-----------------------------------------------------------------

::: pr/src/pthreads/ptsynch.c
@@ -529,5 @@
> -        PR_Lock(&mon->lock);
> -        /* and now I have the lock */
> -        PR_ASSERT(0 == mon->entryCount);
> -        PR_ASSERT(_PT_PTHREAD_THR_HANDLE_IS_INVALID(mon->owner));
> -        _PT_PTHREAD_COPY_THR_HANDLE(self, mon->owner);

Yes, the non-pthreads code in nspr/pr/src/threads/prmon.c has the
same data race in PR_EnterMonitor and can be fixed in a similar way.
I'll do that in a separate patch.

Wan-Teh Chang

Assignee

Comment 13

•

11 years ago

Attached patch Patch v2 (obsolete) — Details — Splinter Review

I eliminated the code duplication during error cleanup in PR_NewMonitor.
This requires the use of goto. Using goto for error cleanup is acceptable
in C code.

jcranmer, could you review this patch? Thank you for your help.

A PRMonitor is a recursive mutex with an implicit condition variable.

The |lock|, |owner|, |entryCV|, and |entryCount| members of PRMonitor
are used to implement the recursive mutex. I use the algorithm described
in Kleiman, Shah, and Smaalders, "Programming with Threads", pp. 259-260.
The boost::recursive_mutex class uses the same algorithm:
http://svn.boost.org/svn/boost/trunk/boost/thread/pthread/recursive_mutex.hpp

It should be easy to review this part of the patch.

The |waitCV| member of PRMonitor is used to implement the implicit condition
variable of a PRMonitor. In general it is not safe to use a recursive
pthread_mutex_t with a pthread_cond_t because pthread_cond_wait may only
unlock the recursive once. This may be why I couldn't find another
implementation of this on the Web. Hopefully it is not hard to review this
part of the patch.

Attachment #818645 - Attachment is obsolete: true

Attachment #821828 - Flags: review?(Pidgeot18)

Wan-Teh Chang

Assignee

Comment 14

•

11 years ago

jcranmer: the Wikipedia article on reentrant mutex (http://en.wikipedia.org/wiki/Reentrant_mutex)
has relevant info for this patch. In particular, the last paragraph describes the "synchronized"
blocks and the wait(), notify(), and notifyAll() methods in Java. PRMonitor was originally used
to implement those features in a JVM.

Ting-Yuan Huang

Comment 15

•

11 years ago

Comment on attachment 821828 [details] [diff] [review]
Patch v2

Review of attachment 821828 [details] [diff] [review]:
-----------------------------------------------------------------

::: pr/src/pthreads/ptsynch.c
@@ +506,5 @@
>  
>  PR_IMPLEMENT(void) PR_AssertCurrentThreadInMonitor(PRMonitor *mon)
>  {
> +    PR_ASSERT(mon->entryCount != 0 &&
> +              pthread_equal(mon->owner, pthread_self()));

Should this and the access to mon->owner in PR_GetMonitorEntryCount() also be protected by mon->lock?

@@ +633,3 @@
>      PR_ASSERT(mon != NULL);
> +    rv = pthread_cond_broadcast(&mon->waitCV);
> +    PR_ASSERT(0 == rv);

The original implementation requires that PR_Notify() and PR_NotifyAll() be called with lock held (at least in debug builds). Because pthread_cond_{signal, broadcase} don't have this limitation, the behavior will be different.

Joshua Cranmer [:jcranmer]

Comment 16

•

11 years ago

Comment on attachment 821828 [details] [diff] [review]
Patch v2

Review of attachment 821828 [details] [diff] [review]:
-----------------------------------------------------------------

From investigating the other implementations, it looks like they also have the underlying race condition and thus need to be fixed. Since we only run TSAN on pthreads-based implementations (I presume), I won't block review on fixing the other implementations.

::: pr/include/private/primpl.h
@@ +1445,5 @@
>  
>  struct PRMonitor {
>      const char* name;           /* monitor name for debugging */
>  #if defined(_PR_PTHREADS)
> +    pthread_mutex_t lock;

I feel like it is worth mentioning that the lock is only held when accessing fields of the PRMonitor here instead of being held while the monitor is entered.

::: pr/src/pthreads/ptsynch.c
@@ +501,3 @@
>      pthread_t self = pthread_self();
>      if (pthread_equal(mon->owner, self))
>          return mon->entryCount;

This method also needs to be protected by the lock (unless no one uses this method, in which case it ought to be deleted).

@@ +506,5 @@
>  
>  PR_IMPLEMENT(void) PR_AssertCurrentThreadInMonitor(PRMonitor *mon)
>  {
> +    PR_ASSERT(mon->entryCount != 0 &&
> +              pthread_equal(mon->owner, pthread_self()));

Echoing what Ting-Yuan says, the entryCount and owner fields need to be protected by the lock as well. (Although, strictly speaking, the data race only occurs if the assert would fire--but since there's a data race in that scenario, the assert may fail to fire).

@@ -595,5 @@
> -    PR_ASSERT(_PT_PTHREAD_MUTEX_IS_LOCKED(mon->lock.mutex));
> -    /* and the entries better be positive */
> -    PR_ASSERT(mon->entryCount > 0);
> -    /* and it better be by us */
> -    PR_ASSERT(pthread_equal(mon->owner, pthread_self()));

You've lost these we're-in-the-monitor-checks, which amount to a change in the API.

Attachment #821828 - Flags: review?(Pidgeot18) → review+

Christian Holler (:decoder)

Comment 17

•

10 years ago

What's the status here? This is one of the most frequent races I'm seeing on TSan.

Flags: needinfo?(wtc)

Wan-Teh Chang

Assignee

Comment 18

•

10 years ago

Attached patch Patch v3 (obsolete) — Details — Splinter Review

Christian: thanks for the reminder. I am terribly sorry that I forgot
to finish this bug. What happened was that I wanted to preserve a
a performance optimization of the original code that signals a
condition variable after unlocking the mutex, but never got around to
doing that.

Joshua and Ting-Yuan: thank you very much for your thoughtful review
comments. I made all the changes you suggested, plus the performance
optimization described above. Please review this new patch. I will
annotate this patch in the next comment.

Attachment #821828 - Attachment is obsolete: true

Attachment #8365617 - Flags: review?(Pidgeot18)

Flags: needinfo?(wtc)

Wan-Teh Chang

Assignee

Comment 19

•

10 years ago

Comment on attachment 8365617 [details] [diff] [review]
Patch v3

Review of attachment 8365617 [details] [diff] [review]:
-----------------------------------------------------------------

::: pr/include/private/primpl.h
@@ +1445,5 @@
>  
>  struct PRMonitor {
>      const char* name;           /* monitor name for debugging */
>  #if defined(_PR_PTHREADS)
> +    PRIntn notifyTimes;         /* number of pending notifies for waitCV */

notifyTimes is only accessed while the current thread owns the monitor,
so it is not really protected by the |lock| field. I will point out where
I take advantage of this.

If you think this point is too subtle, I can also just require notifyTimes
be accessed only when |lock| is held.

::: pr/src/pthreads/ptsynch.c
@@ +188,5 @@
>  }  /* PR_Lock */
>  
>  PR_IMPLEMENT(PRStatus) PR_Unlock(PRLock *lock)
>  {
> +    pthread_t self = pthread_self();

This is just a minor cleanup, to avoid calling pthread_self() twice in
this function.

@@ -312,5 @@
>      notified->length += 1;
> -
> -finished:
> -    PR_ASSERT(PR_TRUE == cvar->lock->locked);
> -    PR_ASSERT(pthread_equal(cvar->lock->owner, pthread_self()));

These two assertions are also present at the beginning of this function.
It is clear that this function doesn't change the conditions asserted,
so we don't need to assert again when exiting the function.

@@ +432,5 @@
> +static void pt_PostNotifyToMonitor(PRMonitor *mon, PRBool broadcast)
> +{
> +    PRIntn rv;
> +
> +    /* Assert that the monitor is owned by us. */

This function is called by both PR_Notify and PR_NotifyAll. So I do
the monitor ownership check here. It does make the monitor ownership
check for PR_Notify and PR_NotifyAll less obvious.

Note: I just realized that I should use the
PR_ASSERT_CURRENT_THREAD_IN_MONITOR macro here.

@@ +443,5 @@
> +
> +    if (broadcast)
> +        mon->notifyTimes = -1;
> +    else if (-1 != mon->notifyTimes)
> +        mon->notifyTimes += 1;

In this function, I am taking advantage of the fact that notifyTimes is
protected by the monitor, so I am modifying notifyTimes after releasing
mon->lock.

This complicates the description of the thread-safe protection of the
fields of a PRMonitor.

Attachment #8365617 - Flags: review?(thuang)

Joshua Cranmer [:jcranmer]

Comment 20

•

10 years ago

Comment on attachment 8365617 [details] [diff] [review]
Patch v3

Review of attachment 8365617 [details] [diff] [review]:
-----------------------------------------------------------------

::: pr/src/pthreads/ptsynch.c
@@ +443,5 @@
> +
> +    if (broadcast)
> +        mon->notifyTimes = -1;
> +    else if (-1 != mon->notifyTimes)
> +        mon->notifyTimes += 1;

I feel like a comment to this effect in the code would be worthwhile.

Attachment #8365617 - Flags: review?(Pidgeot18) → review+

Wan-Teh Chang

Assignee

Comment 21

•

10 years ago

Attached patch Patch v4 — Details — Splinter Review

Joshua: thanks for the review. I added the comment that you suggested.

I am a little worried that ThreadSanitizer may flag the modifications
of mon->notifyTimes in pt_PostNotifyToMonitor as not thread-safe
because mon->notifyTimes is modified in other places with mon->lock
acquired. mon->notifyTimes is only accessed by the owner of PRMonitor,
so different threads' accesses to mon->notifyTimes will be separated
by a transfer of PRMonitor ownership. Hopefully ThreadSanitizer will
consider this thread-safe.

Patch checked in: https://hg.mozilla.org/projects/nspr/rev/b15b05fafb4f

Attachment #8365617 - Attachment is obsolete: true

Attachment #8365617 - Flags: review?(thuang)

Attachment #8367092 - Flags: review?(thuang)

Attachment #8367092 - Flags: checked-in+

Wan-Teh Chang

Assignee

Updated

•

10 years ago

Status: NEW → ASSIGNED

Priority: -- → P1

Target Milestone: --- → 4.10.4

Ting-Yuan Huang

Comment 22

•

10 years ago

Comment on attachment 8367092 [details] [diff] [review]
Patch v4

Review of attachment 8367092 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me except for PR_Wait() with timeout.

::: pr/src/pthreads/ptsynch.c
@@ +558,5 @@
> +    rv = pthread_mutex_lock(&mon->lock);
> +    PR_ASSERT(0 == rv);
> +    PR_ASSERT(mon->entryCount != 0 &&
> +              pthread_equal(mon->owner, pthread_self()));
> +    rv = pthread_mutex_unlock(&mon->lock);

In normal/non-debug build, the net effect is to lock and immediately unlock. How about embracing the lock/unlock together with the assertions by #ifdef DEBUG ? Put the lock/unlock into PR_ASSERT() also make sense to me. I'm not sure which style is preferred in this code base.

@@ +672,5 @@
> +    if (timeout == PR_INTERVAL_NO_TIMEOUT)
> +        rv = pthread_cond_wait(&mon->waitCV, &mon->lock);
> +    else
> +        rv = pt_TimedWait(&mon->waitCV, &mon->lock, timeout);
> +    PR_ASSERT(0 == rv);

Should a timed-out wait be regarded as fatal? In the original implementation (and other implementations I've seen), PR_Wait() returns E_XXX_YYY after acquiring the lock (not mon->lock in this patch, I mean the recursive mutex implemented with entryCV).

Sorry I missed this in previous review.

Attachment #8367092 - Flags: review?(thuang) → review-

Wan-Teh Chang

Assignee

Comment 23

•

10 years ago

Comment on attachment 8367092 [details] [diff] [review]
Patch v4

Review of attachment 8367092 [details] [diff] [review]:
-----------------------------------------------------------------

Ting-Yuan: thanks a lot for your comments. I responded to them below.

::: pr/src/pthreads/ptsynch.c
@@ +558,5 @@
> +    rv = pthread_mutex_lock(&mon->lock);
> +    PR_ASSERT(0 == rv);
> +    PR_ASSERT(mon->entryCount != 0 &&
> +              pthread_equal(mon->owner, pthread_self()));
> +    rv = pthread_mutex_unlock(&mon->lock);

The problem you pointed out in a non-debug build is acceptable, because people are not supposed to call this function directly. Here is the documentation from the prmon.h header:

79 /*
80 ** PR_ASSERT_CURRENT_THREAD_IN_MONITOR
81 ** If the current thread is in |mon|, this assertion is guaranteed to
82 ** succeed.  Otherwise, the behavior of this function is undefined.
83 */
84 #if defined(DEBUG) || defined(FORCE_PR_ASSERT)
85 #define PR_ASSERT_CURRENT_THREAD_IN_MONITOR(/* PRMonitor* */ mon) \
86     PR_AssertCurrentThreadInMonitor(mon)
87 #else
88 #define PR_ASSERT_CURRENT_THREAD_IN_MONITOR(/* PRMonitor* */ mon)
89 #endif
90 
91 /* Don't call this function directly. */
92 NSPR_API(void) PR_AssertCurrentThreadInMonitor(PRMonitor *mon);

I'll make the change you suggested.

@@ +672,5 @@
> +    if (timeout == PR_INTERVAL_NO_TIMEOUT)
> +        rv = pthread_cond_wait(&mon->waitCV, &mon->lock);
> +    else
> +        rv = pt_TimedWait(&mon->waitCV, &mon->lock, timeout);
> +    PR_ASSERT(0 == rv);

This is a good question. The PR_WaitCondVar and PR_Wait functions do not report the timed-out error. This is documented in https://developer.mozilla.org/en-US/docs/PR_WaitCondVar and https://developer.mozilla.org/en-US/docs/PR_Wait.

If pt_TimedWait times out, it still returns 0. This is done at the end of the pt_TimedWait function:

    rv = pthread_cond_timedwait(cv, ml, &tmo);

    /* NSPR doesn't report timeouts */
#ifdef _PR_DCETHREADS
    if (rv == -1) return (errno == EAGAIN) ? 0 : errno;
    else return rv;
#else
    return (rv == ETIMEDOUT) ? 0 : rv;
#endif

The original implementation of PR_Wait returns the return value of the PR_WaitCondVar(mon->cvar, timeout) call, which doesn't report the timed-out error either.

Wan-Teh Chang

Assignee

Comment 24

•

10 years ago

Attached patch PR_AssertCurrentThreadInMonitor should be a no-op when assertions are disabled. — Details — Splinter Review

Attachment #8367536 - Flags: review?(thuang)

Ting-Yuan Huang

Comment 25

•

10 years ago

Comment on attachment 8367536 [details] [diff] [review]
PR_AssertCurrentThreadInMonitor should be a no-op when assertions are disabled.

Review of attachment 8367536 [details] [diff] [review]:
-----------------------------------------------------------------

You are right. I should read the comments more carefully. Sorry about that.

Attachment #8367536 - Flags: review?(thuang) → review+

Ting-Yuan Huang

Updated

•

10 years ago

Attachment #8367092 - Flags: review- → review+

Wan-Teh Chang

Assignee

Comment 26

•

10 years ago

Comment on attachment 8367536 [details] [diff] [review]
PR_AssertCurrentThreadInMonitor should be a no-op when assertions are disabled.

No worries. Thanks again for the review. Patch checked in:
https://hg.mozilla.org/projects/nspr/rev/f2aa2bf5abbb

Attachment #8367536 - Flags: checked-in+

Wan-Teh Chang

Assignee

Comment 27

•

10 years ago

Attached patch Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex (obsolete) — Details — Splinter Review

[Ting-Yuan: I am just using "superreview" as a second review.]

When I pushed my new PRMonitor code to mozilla-inbound, it caused
assertion failures on Mac OS X 10.6. (The assertion failures were
reported in bug 958796 comment 29 because I cited that bug in
the commit message.)

The assertion that failed is this one at the end of PR_ExitMonitor,
at ptsynch.c:637:

 600 PR_IMPLEMENT(PRStatus) PR_ExitMonitor(PRMonitor *mon)
 601 {
         ...
 620     mon->entryCount -= 1;  /* reduce by one */
 621     if (mon->entryCount == 0)
 622     {
 623         /* and if it transitioned to zero - notify an entry waiter */
 624         /* make the owner unknown */
 625         _PT_PTHREAD_INVALIDATE_THR_HANDLE(mon->owner);
 626         notifyEntryWaiter = PR_TRUE;
 627         notifyTimes = mon->notifyTimes;
 628         mon->notifyTimes = 0;
 629     }
 630     rv = pthread_mutex_unlock(&mon->lock);
 631     PR_ASSERT(0 == rv);
 632     if (notifyTimes)
 633         pt_PostNotifiesFromMonitor(&mon->waitCV, notifyTimes);
 634     if (notifyEntryWaiter)
 635     {
 636         rv = pthread_cond_signal(&mon->entryCV);
 637         PR_ASSERT(0 == rv);  <=== FAILED
 638     }
 639     return PR_SUCCESS;
 640 }  /* PR_ExitMonitor */

This reminded me of the need to reference-count PRCondVar because
of the delayed condition variable notifications. (The reference
count in PRCondVar is the 'notify_pending' member.) When I implemented
the same optimization for PRMonitor, I thought a reference count
was not necessary because the mutex and condition variables of a
PRMonitor are created and destroyed together. But I believe I was
wrong.

How to review this patch:

1. First, read the test program monref.c. I hope you will agree
that the test program destroys the PRMonitor at a safe time.

Note: By adding an artificial delay before the
pthread_cond_signal(&mon->entryCV) call on line 636 for the
'ThreadFunc' thread, I can make the test program hang on Linux,
and AddressSanitizer reports heap-use-after-free.

2. Then, review the changes to primpl.h and ptsynch.c that implement
the reference count of PRMonitor.

I know this problem is very subtle. Thank you for your patience in
advance.

Attachment #8371192 - Flags: superreview?(thuang)

Attachment #8371192 - Flags: review?(Pidgeot18)

Wan-Teh Chang

Assignee

Comment 28

•

10 years ago

Comment on attachment 8371192 [details] [diff] [review]
Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex

Review of attachment 8371192 [details] [diff] [review]:
-----------------------------------------------------------------

::: pr/src/pthreads/ptsynch.c
@@ +539,2 @@
>  #endif
> +        PR_Free(mon);    

I will remove the whitespace.

@@ +640,4 @@
>      if (notifyEntryWaiter)
>      {
> +        if (notifyTimes)
> +            pt_PostNotifiesFromMonitor(&mon->waitCV, notifyTimes);

notifyTimes is nonzero only when notifyEntryWaiter is true, so
it is correct to move this inside the "if (notifyEntryWaiter)"
block. This makes it clear that the members of 'mon' are accessed
after unlocking mon->lock only if notifyEntryWaiter is true,
and therefore we only need to release a reference in the same
"if" block.

Ting-Yuan Huang

Comment 29

•

10 years ago

Comment on attachment 8371192 [details] [diff] [review]
Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex

Review of attachment 8371192 [details] [diff] [review]:
-----------------------------------------------------------------

I can't find any problem in this patch. By the way, did you push to the try tree before inbound?

Attachment #8371192 - Flags: superreview?(thuang) → superreview+

Wan-Teh Chang

Assignee

Comment 30

•

10 years ago

Attached patch Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex, v2 — Details — Splinter Review

No, I didn't push to the try server before pushing to mozilla-inbound.

Please review the new patch. I made the following changes.

1. I realized that PR_DestroyMonitor has become a "release reference"
function, so PR_ExitMonitor can just call PR_DestroyMonitor to release
a reference. With this change, PR_DestroyMonitor can compare refCount
with 0 using == 0 instead of <= 0.

2. I added some comments.

Attachment #8371192 - Attachment is obsolete: true

Attachment #8371192 - Flags: review?(Pidgeot18)

Attachment #8371758 - Flags: superreview?(thuang)

Attachment #8371758 - Flags: review?(Pidgeot18)

Wan-Teh Chang

Assignee

Comment 31

•

10 years ago

Attached patch [For reference only] Patch to add an artificial delay in PR_ExitMonitor for a non-primordial thread — Details — Splinter Review

I used this patch and the monref.c test program to reproduce the
heap-use-after-free bug in PR_ExitMonitor.

Note: the primordial thread (PT_THREAD_PRIMORD) is the thread
that runs the main() function, also called the main thread.

Wan-Teh Chang

Assignee

Updated

•

10 years ago

Attachment #8371798 - Attachment is patch: true

Wan-Teh Chang

Assignee

Comment 32

•

10 years ago

Comment on attachment 8371758 [details] [diff] [review]
Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex, v2

Review of attachment 8371758 [details] [diff] [review]:
-----------------------------------------------------------------

Patch checked in: https://hg.mozilla.org/projects/nspr/rev/ef953c9ffbc4

The try server confirms this patch fixes the assertion failure.

I would still appreciate a post-commit review. Thanks.

Attachment #8371758 - Flags: checked-in+

Wan-Teh Chang

Assignee

Comment 33

•

10 years ago

Comment on attachment 8371758 [details] [diff] [review]
Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex, v2

Review of attachment 8371758 [details] [diff] [review]:
-----------------------------------------------------------------

Here are the try results. Check the OS X 10.6 Debug results.

Before: https://tbpl.mozilla.org/?tree=Try&rev=767d9ffc5d6c
After:  https://tbpl.mozilla.org/?tree=Try&rev=708b36ddb7d2

Ting-Yuan Huang

Updated

•

10 years ago

Attachment #8371758 - Flags: superreview?(thuang) → superreview+

Wan-Teh Chang

Assignee

Comment 34

•

10 years ago

Comment on attachment 8371758 [details] [diff] [review]
Add a reference count to PRMonitor because PR_ExitMonitor may access struct members after unlocking the internal mutex, v2

Review of attachment 8371758 [details] [diff] [review]:
-----------------------------------------------------------------

Pushed to mozilla-inbound as part of NSPR_4_10_4_BETA3:
https://hg.mozilla.org/integration/mozilla-inbound/rev/386a2183cc84

Wan-Teh Chang

Assignee

Updated

•

10 years ago

Whiteboard: [tsan][tsan-test-blocker] → [tsan][tsan-test-blocker][leave open]

Phil Ringnalda (:philor)

Comment 35

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/386a2183cc84

Wan-Teh Chang

Assignee

Comment 36

•

10 years ago

Attached patch Port the new PRMonitor code to Windows (obsolete) — Details — Splinter Review

Please review the patch in three steps.

1. prulock.c, prucv.c, and primpl.h (except the changes to struct PRMonitor):
I added _PR_InitLock, _PR_FreeLock, _PR_InitCondVar, and _PR_FreeCondVar so
that PRLock and PRCondVar structures can be initialized and cleaned up
without memory allocation and deallocation. This allows PRMonitor to contain
PRLock and PRCondVar members directly.

I also cleaned up some function declarations in primpl.h.

2. ptsynch.c: I cleaned up some code when I ported the code to Windows
(or rather, non-pthreads platforms).

3. prmon.c and primpl.h (the struct PRMonitor changes): there are two
important differences:
- PRCondVar must be notified while holding PRLock. However, it still
  helps for PR_Notify and PR_NotifyAll to delay the notifications
  because we only need to notify mon->waitCV when we fully exit the
  monitor (i.e., when mon->entryCount becomes 0).
- The reference counting of PRMonitor is not necessary, even though a
  similar optimization of delayed condition variable notifications is
  done inside PR_Unlock. The difference is that this implementation
  of PR_Unlock does that without accessing members of the PRCondVar
  struct. If we ever change this implementation of PR_Unlock, we will
  need to re-evaluate if we need to add a reference to PRMonitor in
  PR_ExitMonitor.

Attachment #8372955 - Flags: superreview?(Pidgeot18)

Attachment #8372955 - Flags: review?(thuang)

Wan-Teh Chang

Assignee

Comment 37

•

10 years ago

Attached patch [Correct patch] Port the new PRMonitor code to non-pthreads platforms (e.g., Windows) (obsolete) — Details — Splinter Review

I attached the wrong patch. Sorry.

Attachment #8372955 - Attachment is obsolete: true

Attachment #8372955 - Flags: superreview?(Pidgeot18)

Attachment #8372955 - Flags: review?(thuang)

Attachment #8372956 - Flags: superreview?(Pidgeot18)

Attachment #8372956 - Flags: review?(thuang)

Joshua Cranmer [:jcranmer]

Updated

•

10 years ago

Attachment #8371758 - Flags: review?(Pidgeot18) → review+

Joshua Cranmer [:jcranmer]

Comment 38

•

10 years ago

Comment on attachment 8372956 [details] [diff] [review]
[Correct patch] Port the new PRMonitor code to non-pthreads platforms (e.g., Windows)

Review of attachment 8372956 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay, these are not easy patches to keep in the brain.

Attachment #8372956 - Flags: superreview?(Pidgeot18) → review+

Wan-Teh Chang

Assignee

Comment 39

•

10 years ago

Attached patch Port the new PRMonitor code to non-pthreads platforms (e.g., Windows), v2 — Details — Splinter Review

Joshua: thank you for the review. It is worth the wait.

Patch checked in, after fixing some coding style nits:
https://hg.mozilla.org/projects/nspr/rev/cd5086f15d8d

Attachment #8372956 - Attachment is obsolete: true

Attachment #8372956 - Flags: review?(thuang)

Attachment #8381858 - Flags: checked-in+

Wan-Teh Chang

Assignee

Comment 40

•

10 years ago

Attached patch Don't map pthread_mutex_init or pthread_cond_init error to PR_OPERATION_NOT_SUPPORTED_ERROR — Details — Splinter Review

PR_OPERATION_NOT_SUPPORTED_ERROR ("The requested operation
is not supported by the platform") seems like the wrong error
code for the failure of pthread_mutex_init or pthread_cond_init.

Since a nonzero return value of pthread_mutex_init and
pthread_cond_init is the Unix error code, we can just map it
to an NSPR error code.

Attachment #8381867 - Flags: review?(Pidgeot18)

Wan-Teh Chang

Assignee

Comment 41

•

10 years ago

Comment on attachment 8381858 [details] [diff] [review]
Port the new PRMonitor code to non-pthreads platforms (e.g., Windows), v2

Review of attachment 8381858 [details] [diff] [review]:
-----------------------------------------------------------------

Patch pushed to mozilla-inbound as part of the NSPR_4_10_4_BETA4 update:
https://hg.mozilla.org/integration/mozilla-inbound/rev/65f75ee1c34f

Wan-Teh Chang

Assignee

Updated

•

10 years ago

Whiteboard: [tsan][tsan-test-blocker][leave open] → [tsan][tsan-test-blocker]

Ed Morley [:emorley]

Comment 42

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/65f75ee1c34f

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Joshua Cranmer [:jcranmer]

Updated

•

10 years ago

Attachment #8381867 - Flags: review?(Pidgeot18) → review+

Wan-Teh Chang

Assignee

Comment 43

•

10 years ago

Comment on attachment 8381867 [details] [diff] [review]
Don't map pthread_mutex_init or pthread_cond_init error to PR_OPERATION_NOT_SUPPORTED_ERROR

Patch checked in (for NSPR 4.10.5):
https://hg.mozilla.org/projects/nspr/rev/b033da9a671e

Attachment #8381867 - Flags: checked-in+

You need to log in before you can comment on or make changes to this bug.