Open Bug 551155 Opened 13 years ago Updated 5 months ago

mark up Fx codebase with race-detector annotations


(Core :: General, defect)





(Reporter: jseward, Assigned: jseward)


(Depends on 3 open bugs)


(Keywords: sec-want, Whiteboard: [sg:want])


(4 files, 9 obsolete files)

2.68 KB, text/plain
44.12 KB, patch
Details | Diff | Splinter Review
41.70 KB, patch
Details | Diff | Splinter Review
6.88 KB, patch
: feedback+
Details | Diff | Splinter Review
It would be nice to be able to use Valgrind/Helgrind and possibly
other run-time race detectors to routinely find and remove data
races in Fx, in the same way that Valgrind/Memcheck and Purify are
currently used to find and remove memory errors.  The ultimate aim
is to ship a codebase with no detectable threading errors, in the
style of bug 549224.

For this to be viable requires race detectors to be able to "see"
all inter-thread synchronisation events.  Most such tools are limited
to intercepting pthread_* calls, and require markup of the to-be-tested
application do deal with any other custom lock primitives, etc.

Annotations need to be added in two sets of places:

  around implementations of custom threading primitives
  (eg, JSThinLock)

  at allocation points (constructors) for memory areas which are
  known to be harmlessly raced on (eg, JSRuntime::gcPoke).  These
  annotations tell race detectors to ignore (don't track) such
  memory areas, at least until they are deallocated.

bug 547736 makes a start, with annotations for Valgrind/Helgrind
for jsengine.
Depends on: 547736
Assignee: nobody → jseward
I ran the simplest full-system test I could think of, which is to
start Fx, wait for a blank page, then do File->Quit.  What follows is
a rough partial triage and commentary of findings.

This is: m-c, 11 Mar 2010, x86_64-Linux, -g -O2, release build, 
running on valgrind --tool=helgrind, trunk svn r11089 or later.
DISPLAY is a VNC X server, 960x720x16, one xterm, no window manager.

I started with the SpiderMonkey markup patch at bug 547736 comment 15.
Even with that patch, hundreds of races are reported.  As per this
triage I extended it by adding annotations to mark & hide
obviously-harmless races.  Result is that Fx startup/exit now shows
only 21 races or other errors (lock ordering problems).

Patch (very much WIP) is attached.  Also a file showing the remaining
21 complaints.

--- Not-obviously-harmless races ------------------------------

jsengine gc
   as per bug 551739, JSContext::requestDepth is raced on
   is probably harmless as per comments by Brendan & Jason
   (but rules about why this is OK are complex)

xpcom/threads/TimerThread.cpp:262 vs xpcom/threads/nsTimerImpl.cpp:572
   nsTimerImpl::mTimeout is accessed while mLock is held at
   TimerThread.cpp:262, but mLock is not held during access at
   nsTimerImpl.cpp:572.  Intentional?  I don't know.

   262: if (!TIMER_LESS_THAN(now, timer->mTimeout + mTimeoutAdjustment)) {

   572: mTimeout = now;

--- Harmless races --------------------------------------------

   PRThread::id is raced on (line 185 vs 521)
   Harmless and commented-on in the source
   ANNOTATED so we won't see this again

   PRMonitor::owner is raced on, at least in PR_EnterMonitor
      if (!pthread_equal(mon->owner, self))
      _PT_PTHREAD_COPY_THR_HANDLE(self, mon->owner);
   Harmless, I'm pretty sure.
   ANNOTATED so we won't see this again

xpcom/glue/nsISupportsImpl.h (deep in the innards of ref-counting)
   NS_IMETHODIMP_(nsrefcnt) _class::Release(void)
   which implements threadsafe 'Release' (drop-a-reference) for
   whatever class, writes to mRefCnt without a lock once it has
   become zero.  This is harmless because no other thread can
   access it once the refcount is zero.
   ANNOTATED so we won't see this again

   Block::freeNode is accessed MT without a lock.  As per comments
   in the source, this is harmless.
   ANNOTATED so we won't see this again

   nsRecyclingAllocator::Malloc(PRSize bytes, PRBool zeroit)
   hands out chunks of memory which previously "belonged" to
   other threads.  We need to tell race checkers that the memory
   now belongs to the new owner.
   ANNOTATED accordingly

jsengine structures
   as per bug 547736, the following fields have harmless races
   JSGCStats::arenaStats (debug builds only)
   The patch incorporates annotations from 547736 to handle them.
Contains a couple of things that should be split out into separate bugs:

* nsprpub/pr/src/md/unix/os_Linux_x86_64.s: add .type/.size pseudo ops
  so the functions names defined in this file appear in stack traces

* js_InitTitle(JSContext *cx, JSTitle *title): actually call 
  js_InitLock on title->lock; don't just memset-0 it
Attachment #432142 - Attachment is obsolete: true
More races triaged:

* XPCPerThreadData::~XPCPerThreadData, bug 557586.  Due to the
  particular order of thread exiting, is harmless.

* nsTArray_base::IncrementLength(PRUint32 n) will non-atomically
  add zero to the mLength field of nsTArray_base::sEmptyHdr, which
  is a special header for zero-length arrays, in the case where
  a zero-length array is appended to an existing zero-length array.
  This gives a R-M-W race against reads from other threads of
  nsTArray_base::sEmptyHdr.mLength (presumably for arrays which
  are also empty).

  So is harmless.  One fix is to change IncrementLength thusly:

    void IncrementLength(PRUint32 n) {
      NS_ASSERTION(mHdr != &sEmptyHdr || n == 0, "bad data pointer");
      if (n > 0)
          mHdr->mLength += n;

  or can just annotate to say, don't track sEmptyHdr.mLength.

  (w/ thanks to cjones and timeless_mbp for helping me make sense
  of this)
Possible vtable-overwrite races in xpcom

Pretty mind-bending stuff.  I don't know yet if these are harmful or
not, or even if they are real -- perhaps some synchronisation event is
being missed by the checker.

Imagine we have a base class A with a virtual destructor, and class B
that inherits from it:

  class A {
     virtual ~A() { ... }

  class B : public A {
     virtual ~B() { ... }

When an object of (dynamic) type B is destroyed, we enter ~B.  This
does any B-specific finalisation, and then calls onwards to ~A.
However, in order that ~A runs with the correct vtable, ~B must first
overwrite the first word of the object -- the vtable pointer -- with
the address of the vtable for A.

If a second thread is meanwhile doing virtual function calls on the
object, then it will read the vtable pointer and so there is a race.

You might well ask, in that case how can the second thread avoid
segfaulting in the case where ~A deallocates the memory holding the
object and then it is munmapped?  Well, by causing ~A to wait for a
signal of some kind from the second thread before deallocating (viz,
returning).  Such code is still thread-unsafe, though, because the
second thread will have done virtual function calls on an object
whose vtable pointer is being changed by the first thread.

I have observed various races in xpcom along these lines.  They are
also exhibited more simply by xpcom's unit test,
xpcom/tests/TestThreads.cpp.  They are all characterised by:

* one thread does a write, the other a read

* the writing thread is claimed to be at the opening brace of a
  virtual destructor for some class C

* the raced-on address is the first word of an object of class C,
  being destroyed by the writing thread

* the old value (that is just about to be overwritten) is a pointer to
  C's vtable


Possible data race during write of size 8 at 0xc6c31b0 by thread #1
   at 0x52CC66B: nsThreadStartupEvent::~nsThreadStartupEvent() 
   by 0x528E696: nsRunnable::Release() (nsThreadUtils.cpp:55)
   by 0x52CBB4F: nsThread::Init() (nsThread.cpp:342)
   by 0x52CCA61: nsThreadManager::NewThread(unsigned int, nsIThread**) 
   by 0x528E4FE: NS_NewThread_P(nsIThread**, nsIRunnable*) 
   by 0x400C62: TestThreads() (TestThreads.cpp:14760)
   by 0x400CEC: main (TestThreads.cpp:14787)

 This conflicts with a previous read of size 8 by thread #2
   at 0x528831B: nsCOMPtr_base::assign_assuming_AddRef(nsISupports*) 
   by 0x52CC5D8: nsCOMPtr<nsIRunnable>::operator=(nsIRunnable*) (nsCOMPtr.h:640)
   by 0x52CBC45: nsThread::ThreadFunc(void*) (nsThread.cpp:250)
   by 0x5B56FC2: _pt_root (ptthread.c:230)
   by 0x4C2A5D8: mythread_wrapper (hg_intercepts.c:213)
   by 0x4E34A03: start_thread (pthread_create.c:300)
   by 0x67F880C: clone (clone.S:112)

 Address 0xc6c31b0 is 0 bytes inside a block of size 32 alloc'd
   at 0x4C25B08: malloc (vg_replace_malloc.c:236)
   by 0x5524E8B: moz_xmalloc (mozalloc.cpp:75)
   by 0x52CC5FA: nsThreadStartupEvent::Create() (mozalloc.h:222)
   by 0x52CBAAF: nsThread::Init() (nsThread.cpp:315)
   by 0x52CCA61: nsThreadManager::NewThread(unsigned int, nsIThread**) 
   by 0x528E4FE: NS_NewThread_P(nsIThread**, nsIRunnable*) 
   by 0x400C62: TestThreads() (TestThreads.cpp:14760)
   by 0x400CEC: main (TestThreads.cpp:14787)

 The old value at address 0xc6c31b0 was 0x551de70
   and 0x551de70 is symbol "vtable for nsThreadStartupEvent+16"
(In reply to comment #6)
> Possible vtable-overwrite races in xpcom

These were analysed and found to be false positives in bug 558402, now
closed-as-invalid.  Issue is Helgrind does not understand that the
last Release-r of an object becomes its exclusive owner and may
arbitrarily mess with it without any locking during its destruction.
This is trivially fixed by annotating NS_IMPL_THREADSAFE_RELEASE.
Update of current status:

About 2/3 of the races reported for "start browser to blank page; wait
for a while; quit" have been triaged.  Outcome is: two false positives
to do with reference counting, and a number of races that are
harmless.  It's not clear that all of the latter were known about or
intended by the code's authors.

Currently there remain untriaged race reports in the following


  nsTimerImpl::SetDelayInternal(unsigned int)






There's also a possible lock order problem which looks like it might
have to do with /usr/lib/
(In reply to comment #8)
>   nsDiskCacheOutputStream::Close()

oooh, please look into that one first.
> (In reply to comment #8)
> >   nsDiskCacheOutputStream::Close()
> oooh, please look into that one first.

The race isn't always reported, only on about 50% of startup-then-idle

Relevant source file is netwerk/cache/src/nsDiskCacheStreams.cpp.

It appears nsDiskCacheOutputStream::mClosed is accessed by multiple
threads without locking.  I'm seeing nsDiskCacheOutputStream::Close
racing against nsDiskCacheOutputStream::Write and also (I suspect)
against itself:

233 nsDiskCacheOutputStream::Close()
234 {
235     if (!mClosed) {
236         mClosed = PR_TRUE;
237         // tell parent streamIO we are closing
238         mStreamIO->CloseOutputStream(this);
239     }
240     return NS_OK;
241 }

254 nsDiskCacheOutputStream::Write(const char *buf, PRUint32 count, PRUint32 *bytesWritten)
255 {
256     if (mClosed)  return NS_BASE_STREAM_CLOSED;
257     return mStreamIO->Write(buf, count, bytesWritten);
258 }

* Write at line 236 races against read at line 256
  (::Write vs ::Close)

* Write at line 236 races against read at line 235
  (::Close vs ::Close)

* Write at line 236 races against write at line 236
  (::Close vs ::Close)

So imagine 2 threads enter ::Close in lockstep, both find mClosed ==
PR_FALSE, so mStreamIO->CloseOutputStream(this) is done twice (and
without locking) for the same stream.  Is that ok?

Also (probably more serious) imagine two threads that want
to do ::Write and then ::Close.  Initially mClose == PR_FALSE.
T1 enters ::Write, does its write, then enters ::Close,
does ::mClosed = PR_TRUE, then stalls.  T2 enters ::Write,
sees mClosed == PR_TRUE, returns, hence its data is never
Hmm, possibly is more of the same thread-safe-reference-counting
confuses Helgrind problem.  Needs further analysis.
Making sense of race reports is way more difficult than making sense
of memory error reports.  I'll need help from the module owners.
Am adopting the strategy of filing what I have as separate bugs,
marking them UNCONFIRMED for the time being.  If they turn out to
be bogus or harmless they can be closed-as-invalid-ed.

AFAICS triaging a race report can have 3 outcomes:

(1) it's a real race, which is harmful.  This is the only "useful"

(2) it's a real race, but is not dangerous.  Options are to fix it
    anyway, or add an annotation to the source directing Helgrind to
    ignore it.  At a minimum we need to ensure the source contains a
    comment making it clear that the race is known to exist, and
    (preferably) come up with a rigorous argument why the race is

(3) it's not a race.  This happens when Helgrind fails to see some
    inter-thread synchronisation event.  The "fix" is again to annotate
    the Fx source to describe what is going on.

A good example of (3) are Release methods as described in comment 7.
Hi Julian, are you still working on this or has this moved to the far back burner?
Working actively on this -- see also bug 547736 comment 17.  This
week I've fixed enough scaling problems in Helgrind to make it
feasible to run substantial chunks of mochitests in one go.  Now
looking at redoing this patch, including redoing annotations of thread
safe ref counting (nsCOMPtr et al), which I realise now I did 
incorrectly in previous attempts.
WIP patch, suitable for Helgrinding SpiderMonkey.  Also greatly
reduces the noise level from a browser startup/quit.

General idea is for the patch to contain a mix of

- annotations which are essential to describe to Helgrind, sync
  operations which it doesn't otherwise understand (thread safe
  refcounting, mostly)

- annotations to hide detected races.

Idea is for the patch to serve as a record of the detected races.
Races that are considered harmful should get filed as bugs and
eventually fixed, in which case their hiding-annotations can be
removed from future versions of the patch.

All annotations are marked with a HGANN comment line, to make them
easily greppable.  Currently there are 26 annotations, of which
6 describe thread-safe refcounting, 7 describe other sync events,
11 hide known-harmless races, and 2 hide non-harmless races.
The complete list is below.  n-i-bz means not-in-bugzilla.

// HGANN 001 BEGIN 551155 HIDES_HARMLESS sEmptyHdr::mLength += 0
// HGANN 001 END

/* HGANN 002 BEGIN REQD_ANN thread-safe-ref-counting */                     \
/* HGANN 002 END */                                                         \

/* HGANN 003 BEGIN REQD_ANN thread-safe-ref-counting */               \
/* HGANN 003 END */                                                   \

// HGANN 004 BEGIN REQD_ANN thread-safe-ref-counting
// HGANN 004 END

/* HGANN 005 BEGIN REQD_ANN thread-safe-ref-counting */
/* HGANN 005 END */

// HGANN 006 BEGIN n-i-bz HIDES_HARMLESS on JSRuntime::various_fields
// HGANN 006 END

// HGANN 007 BEGIN n-i-bz HIDES_HARMLESS on JSThreadData::requestDepth
// HGANN 007 END

// HGANN 008 BEGIN n-i-bz HIDES_HARMLESS on JSContext::outstandingRequests
// HGANN 008 END

// HGANN 009 BEGIN n-i-bz HIDES_HARMLESS on LocalTZA (global var)
// HGANN 009 END

// HGANN 010 BEGIN REQD_ANN js_InitLock
// HGANN 010 END

// HGANN 011 BEGIN REQD_ANN js_FinishLock
// HGANN 011 END

// HGANN 012 END

// HGANN 013 BEGIN REQD_ANN ThinUnlock
// HGANN 013 END

// HGANN 014 BEGIN n-i-bz HIDES_HARMLESS on Assembler::nHints (global)
// HGANN 014 END

// HGANN 015 BEGIN n-i-bz HIDES_HARMLESS on Assembler::nHints (global)
// HGANN 015 END

// HGANN 016 BEGIN 559927 REQD_FIX unclear-if-real-bug-nor-if-correct-fix
// HGANN 016 END

// HGANN 017 BEGIN n-i-bz HIDES_HARMLESS on nsSocketTransport::mResolving
// HGANN 017 END

// HGANN 018 BEGIN REQD_ANN thread-safe-ref-counting
// HGANN 018 END

// HGANN 019 BEGIN 560579 REQD_FIX avoid bitfield race in nsHttpTransaction
// HGANN 019 END

// HGANN 020 END

// HGANN 021 END

// HGANN 022 BEGIN n-i-bz HIDES_HARMLESS on PRMonitor::owner (RECHECK)
// HGANN 022 END

// HGANN 023 BEGIN n-i-bz HIDES_HARMLESS on PRThread::id (RECHECK)
// HGANN 023 END

// HGANN 024 BEGIN n-i-bz HIDES_HARMLESS on nsRecyclingAllocator::mFreeList
// HGANN 024 END

// HGANN 025 BEGIN REQD_ANN nsRecyclingAllocator::Malloc
// HGANN 025 END

/* HGANN 026 BEGIN REQD_ANN thread-safe-ref-counting */
/* HGANN 026 END */
Attachment #437280 - Attachment is obsolete: true
Above results achieveable with Valgrind-3.6.1 or trunk.  Trunk is
much preferable due to significant scalability improvements in
Helgrind.  In either case the as-yet-uncommitted patch below is
needed for correct operation.

Index: helgrind/hg_main.c
--- helgrind/hg_main.c	(revision 11576)
+++ helgrind/hg_main.c	(working copy)
@@ -3207,7 +3207,8 @@
    so = map_usertag_to_SO_lookup_or_alloc( usertag );
-   libhb_so_send( thr->hbthr, so, True/*strong_send*/ );
+   //libhb_so_send( thr->hbthr, so, True/*strong_send*/ );
+   libhb_so_send( thr->hbthr, so, False/*!strong_send*/ );
Adds description and hiding-annotations of 6 further races shown below.
Together with a small suppressions file to hide complainage to do with
Gnome libraries and to do with glibc's DNS lookups, this allows a clean
(no-reported-races) startup to blank page, then exit, of the browser.

Note this doesn't fix any of the races.  It merely continues the process
of cataloguing them.

// HGANN 027 BEGIN n-i-bz HIDES_UNASSESSED on ::sGlobalObserver (global var)
// HGANN 027 END

// HGANN 028 BEGIN n-i-bz HIDES_UNASSESSED on ns[..]::mInitialized (d-c locking)
// HGANN 028 END

// HGANN 029 BEGIN n-i-bz HIDES_UNASSESSED on Canary::sOutputFD (global var)
// HGANN 029 END

// HGANN 030 BEGIN 637461 HIDES_UNASSESSED on StartupCache::mWriteThread
// HGANN 030 END

// HGANN 031 BEGIN 559793 HIDES_UNASSESSED on TimerThread::mTimeoutAdjustment
// HGANN 031 END

// HGANN 032 BEGIN 559793 HIDES_UNASSESSED on nsTimerImpl::mArmed
// HGANN 032 END
Attachment #516080 - Attachment is obsolete: true
For use w/ valgrind trunk on x64-linux (Ubuntu 10.04.2)
Attachment #432143 - Attachment is obsolete: true
Attachment #516246 - Attachment is obsolete: true
Attachment #516248 - Attachment is obsolete: true
Attachment #521198 - Attachment is obsolete: true
What's required to get this into the tree? It would be very nice to have this also from a security standpoint.
Whiteboard: [sg:want]
Keywords: sec-want
Attachment #524238 - Attachment is obsolete: true
While I am investigating strange usage of uninitialized value
(Bug 803816 - Valgrind warnings about uninitialised memory use (Thunderbird)), I tried to run thunderbird under helgrind.

After trying to apply the patch automatically and saw there were many FAILED hunks, I re-created the patch against ./mozilla subdirectory of comm-central.

The version of mozilla I used is (in comm-central/mozilla subdirectory)

    $ hg identify
    a517f7ea5bef+ tip

I enclosed the changes into


appropriately as much as possible.

In order to enable the if-defed parts, when I ran configure and make I added -DUSHELGRIND=1 to CC and CXX macros manually. (Someone might be able to figure out how to add these to compiler flags automagically).

E.g., like this in a shell script:

export CC="/usr/bin/gcc -DDEBUG_4GB_CHECK -DUSEHELGRIND=1"
export CXX="/usr/bin/g++ -DDEBUG_4GB_CHECK -DUSEHELGRIND=1"
time make -f \
    CC="/usr/bin/gcc -DDEBUG_4GB_CHECK -DUSEHELGRIND=1" \
    CXX="/usr/bin/g++ -DDEBUG_4GB_CHECK -DUSEHELGRIND=1" \

For creating the patch, I found a few issues:

However, there is one header file which is not amenable to a small patch using #if/#else/#endif like this. Also, there was an assembler source file which I left as is.

For one file, already a patch has been incorporated into the repository to include the suggested mlock and so I did not bother to do anything for this file.

>patching file netwerk/base/src/nsAsyncStreamCopier.cpp
>Hunk #1 FAILED at 272.
>1 out of 1 hunk FAILED -- saving rejects to file netwerk/base/src/nsAsyncStreamCopier.cpp.rej

The following file may be a problem.  I found that no code remains at all in the header file that looks like the ones that was patched. (???) So this change is not in the re-created patch at all. I have no idea what is going on here.

>patching file netwerk/dns/nsHostResolver.h
>Hunk #1 FAILED at 62.
>1 out of 1 hunk FAILED -- saving rejects to file netwerk/dns/nsHostResolver.h.rej

>Assembly Source for 64 bits:?
>patching file nsprpub/pr/src/md/unix/os_Linux_x86_64.s
>Hunk #1 succeeded at 26 (offset -32 lines).

>Use of IF/ENDIF may not be possible in macro definition.
>patching file xpcom/glue/nsISupportsImpl.h

Hope this may help the inclusion of the code into the main repository. Someone ought to check the operation of thunderbird and firefox occasionally under valgrind (memgrind, helgrind, etc.)
(I am uploading the patch again as "patch" so that "diff" display feature can be used.)
I am obsoleting the previous upload.
Attachment #678282 - Attachment is obsolete: true
I think that the portion of the patch to xpcom/base/nsAtomicRefcnt.h below
got the type wrong: i.e., nsrefcnt needs to be replaced with int32_t in two places. I am running with such modification for now.

-inline int32_t
+inline nsrefcnt
+__attribute__((noinline)) static nsrefcnt
 NS_AtomicDecrementRefcnt(int32_t &refcnt)
   return PR_ATOMIC_DECREMENT(&refcnt);
This is a minimal starting-off patch, for consideration.  It is the
first of a series of patches which makes Gecko race-detector friendly.

The patch describes the synchronisation effects caused by threadsafe
ref counting, as this is not something tools can themselves infer.  In
effect it says that the thread that causes a 1->0 refcount transition
acquires exclusive ownership of the object, so that subsequent
accesses to it (to destruct it) can legitimately be done without

This is for the classes: base::subtle::RefCountedBase,
nsHttpTransaction and (most importantly) nsISupports.

The patch also tightens up the relevant feature test
(--enable-valgrind) to ensure all the required headers are present.

The annotations are commented in a particular way, that makes it
possible to get a summary of all annotations in the code base by
grepping for the string HGANN.  This is important because there will
eventually be about 30 ish annotations needed, most of which hide
known races.
Attachment #699268 - Flags: feedback?(benjamin)
Comment on attachment 699268 [details] [diff] [review]
A minimalist starting-off patch, for consideration

This is fine by me, but I think cjones is probably the best person to provide feedback, since he's my expert on concurrent programming.
Attachment #699268 - Flags: feedback?(benjamin) → feedback?(jones.chris.g)
Comment on attachment 699268 [details] [diff] [review]
A minimalist starting-off patch, for consideration

The approach looks fine.

Do helgrind/VEX have any provisions to run the HB/HA client requests atomically with the atomic memory mutation?
Attachment #699268 - Flags: feedback?(jones.chris.g) → feedback+
(In reply to Chris Jones [:cjones] [:warhammer] from comment #29)
> Do helgrind/VEX have any provisions to run the HB/HA client requests
> atomically with the atomic memory mutation?

No, it doesn't.  In this case (to do with threadsafe refcounting) I don't
think it matters much if we say that the HB event happens slightly before
the atomic op and the HA happens after, so long as it doesn't report a
race on the atomically modified location (mRefCnt) itself, which it
indeed doesn't.
Julian, let's get the minimal patch in. Is cjones' f+ enough here?
XPCOM refcounting changes pushed to bug 848305.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.