Last Comment Bug 551155 - mark up Fx codebase with race-detector annotations
: mark up Fx codebase with race-detector annotations
Status: NEW
[sg:want]
: sec-want
Product: Core
Classification: Components
Component: General (show other bugs)
: Trunk
: x86 Linux
: -- normal (vote)
: ---
Assigned To: Julian Seward [:jseward]
:
Mentors:
: 547736 (view as bug list)
Depends on: 847764 848303 848305 547736 848312
Blocks:
  Show dependency treegraph
 
Reported: 2010-03-09 05:39 PST by Julian Seward [:jseward]
Modified: 2014-06-13 23:52 PDT (History)
19 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
patch as referred to in comment 1 (WIP) (18.67 KB, patch)
2010-03-12 06:52 PST, Julian Seward [:jseward]
no flags Details | Diff | Review
list of remaining errors as referred to in comment 1 (57.19 KB, text/plain)
2010-03-12 06:54 PST, Julian Seward [:jseward]
no flags Details
WIP rebased to m-c 40484:523d0f8e0926 (18.54 KB, patch)
2010-04-06 06:41 PDT, Julian Seward [:jseward]
no flags Details | Diff | Review
WIP patch against M-C of 24 Feb 2011 (37.06 KB, patch)
2011-03-01 16:40 PST, Julian Seward [:jseward]
no flags Details | Diff | Review
WIP patch against M-C of 24 Feb 2011 (revised) (40.95 KB, patch)
2011-03-02 05:53 PST, Julian Seward [:jseward]
no flags Details | Diff | Review
suppressions file referred to in comment 17 (1.82 KB, application/octet-stream)
2011-03-02 05:59 PST, Julian Seward [:jseward]
no flags Details
WIP patch against M-C of 23 Mar 2011 (45.22 KB, patch)
2011-03-23 09:25 PDT, Julian Seward [:jseward]
no flags Details | Diff | Review
Suppressions to go with markup in comment #19 (for Ubuntu 10.04 64 bit) (2.68 KB, text/plain)
2011-03-23 09:36 PDT, Julian Seward [:jseward]
no flags Details
WIP patch against M-C of 6 Apr 2011 (43.52 KB, patch)
2011-04-06 12:46 PDT, Julian Seward [:jseward]
no flags Details | Diff | Review
WIP patch against M-C of 23 Apr 2012 (44.12 KB, patch)
2012-05-08 14:34 PDT, Julian Seward [:jseward]
no flags Details | Diff | Review
Re-created patch against later C-C mozilla subdirectory (with USEHELGRIND macro) (41.70 KB, text/plain)
2012-11-05 05:25 PST, ISHIKAWA, Chiaki
no flags Details
Re-created patch against later C-C mozilla subdirectory (with USEHELGRIND macro) (41.70 KB, patch)
2012-11-05 05:28 PST, ISHIKAWA, Chiaki
no flags Details | Diff | Review
A minimalist starting-off patch, for consideration (6.88 KB, patch)
2013-01-08 09:00 PST, Julian Seward [:jseward]
cjones.bugs: feedback+
Details | Diff | Review

Description Julian Seward [:jseward] 2010-03-09 05:39:06 PST
It would be nice to be able to use Valgrind/Helgrind and possibly
other run-time race detectors to routinely find and remove data
races in Fx, in the same way that Valgrind/Memcheck and Purify are
currently used to find and remove memory errors.  The ultimate aim
is to ship a codebase with no detectable threading errors, in the
style of bug 549224.

For this to be viable requires race detectors to be able to "see"
all inter-thread synchronisation events.  Most such tools are limited
to intercepting pthread_* calls, and require markup of the to-be-tested
application do deal with any other custom lock primitives, etc.

Annotations need to be added in two sets of places:

  around implementations of custom threading primitives
  (eg, JSThinLock)

  at allocation points (constructors) for memory areas which are
  known to be harmlessly raced on (eg, JSRuntime::gcPoke).  These
  annotations tell race detectors to ignore (don't track) such
  memory areas, at least until they are deallocated.

bug 547736 makes a start, with annotations for Valgrind/Helgrind
for jsengine.
Comment 1 Julian Seward [:jseward] 2010-03-12 06:45:46 PST
I ran the simplest full-system test I could think of, which is to
start Fx, wait for a blank page, then do File->Quit.  What follows is
a rough partial triage and commentary of findings.

This is: m-c, 11 Mar 2010, x86_64-Linux, -g -O2, release build, 
running on valgrind --tool=helgrind, trunk svn r11089 or later.
DISPLAY is a VNC X server, 960x720x16, one xterm, no window manager.

I started with the SpiderMonkey markup patch at bug 547736 comment 15.
Even with that patch, hundreds of races are reported.  As per this
triage I extended it by adding annotations to mark & hide
obviously-harmless races.  Result is that Fx startup/exit now shows
only 21 races or other errors (lock ordering problems).

Patch (very much WIP) is attached.  Also a file showing the remaining
21 complaints.


--- Not-obviously-harmless races ------------------------------

jsengine gc
   as per bug 551739, JSContext::requestDepth is raced on
   is probably harmless as per comments by Brendan & Jason
   (but rules about why this is OK are complex)


xpcom/threads/TimerThread.cpp:262 vs xpcom/threads/nsTimerImpl.cpp:572
   nsTimerImpl::mTimeout is accessed while mLock is held at
   TimerThread.cpp:262, but mLock is not held during access at
   nsTimerImpl.cpp:572.  Intentional?  I don't know.

   262: if (!TIMER_LESS_THAN(now, timer->mTimeout + mTimeoutAdjustment)) {

   572: mTimeout = now;


--- Harmless races --------------------------------------------

nsprpub/pr/src/pthreads/ptthread.c:
   PRThread::id is raced on (line 185 vs 521)
   Harmless and commented-on in the source
   ANNOTATED so we won't see this again


nsprpub/pr/src/pthreads/ptsynch.c:
   PRMonitor::owner is raced on, at least in PR_EnterMonitor
      if (!pthread_equal(mon->owner, self))
   vs
      _PT_PTHREAD_COPY_THR_HANDLE(self, mon->owner);
   Harmless, I'm pretty sure.
   ANNOTATED so we won't see this again


xpcom/glue/nsISupportsImpl.h (deep in the innards of ref-counting)
   NS_IMETHODIMP_(nsrefcnt) _class::Release(void)
   which implements threadsafe 'Release' (drop-a-reference) for
   whatever class, writes to mRefCnt without a lock once it has
   become zero.  This is harmless because no other thread can
   access it once the refcount is zero.
   ANNOTATED so we won't see this again


xpcom/ds/nsRecyclingAllocator.cpp
   Block::freeNode is accessed MT without a lock.  As per comments
   in the source, this is harmless.
   ANNOTATED so we won't see this again


xpcom/ds/nsRecyclingAllocator.cpp
   nsRecyclingAllocator::Malloc(PRSize bytes, PRBool zeroit)
   hands out chunks of memory which previously "belonged" to
   other threads.  We need to tell race checkers that the memory
   now belongs to the new owner.
   ANNOTATED accordingly


jsengine structures
   as per bug 547736, the following fields have harmless races
   JSRuntime::gcPoke
   JSRuntime::protoHazardShape
   JSContext::operationCallbackFlag
   JSGCStats::arenaStats (debug builds only)
   The patch incorporates annotations from 547736 to handle them.
Comment 2 Julian Seward [:jseward] 2010-03-12 06:52:17 PST
Created attachment 432142 [details] [diff] [review]
patch as referred to in comment 1 (WIP)

Contains a couple of things that should be split out into separate bugs:

* nsprpub/pr/src/md/unix/os_Linux_x86_64.s: add .type/.size pseudo ops
  so the functions names defined in this file appear in stack traces

* js_InitTitle(JSContext *cx, JSTitle *title): actually call 
  js_InitLock on title->lock; don't just memset-0 it
Comment 3 Julian Seward [:jseward] 2010-03-12 06:54:24 PST
Created attachment 432143 [details]
list of remaining errors as referred to in comment 1
Comment 4 Julian Seward [:jseward] 2010-04-06 06:41:46 PDT
Created attachment 437280 [details] [diff] [review]
WIP rebased to m-c 40484:523d0f8e0926
Comment 5 Julian Seward [:jseward] 2010-04-07 15:16:10 PDT
More races triaged:

* XPCPerThreadData::~XPCPerThreadData, bug 557586.  Due to the
  particular order of thread exiting, is harmless.

* nsTArray_base::IncrementLength(PRUint32 n) will non-atomically
  add zero to the mLength field of nsTArray_base::sEmptyHdr, which
  is a special header for zero-length arrays, in the case where
  a zero-length array is appended to an existing zero-length array.
  This gives a R-M-W race against reads from other threads of
  nsTArray_base::sEmptyHdr.mLength (presumably for arrays which
  are also empty).

  So is harmless.  One fix is to change IncrementLength thusly:

    void IncrementLength(PRUint32 n) {
      NS_ASSERTION(mHdr != &sEmptyHdr || n == 0, "bad data pointer");
      if (n > 0)
          mHdr->mLength += n;
    }

  or can just annotate to say, don't track sEmptyHdr.mLength.

  (w/ thanks to cjones and timeless_mbp for helping me make sense
  of this)
Comment 6 Julian Seward [:jseward] 2010-04-09 05:59:22 PDT
Possible vtable-overwrite races in xpcom

Pretty mind-bending stuff.  I don't know yet if these are harmful or
not, or even if they are real -- perhaps some synchronisation event is
being missed by the checker.

Imagine we have a base class A with a virtual destructor, and class B
that inherits from it:

  class A {
     ...
     virtual ~A() { ... }
  }

  class B : public A {
     ... 
     virtual ~B() { ... }
  }

When an object of (dynamic) type B is destroyed, we enter ~B.  This
does any B-specific finalisation, and then calls onwards to ~A.
However, in order that ~A runs with the correct vtable, ~B must first
overwrite the first word of the object -- the vtable pointer -- with
the address of the vtable for A.

If a second thread is meanwhile doing virtual function calls on the
object, then it will read the vtable pointer and so there is a race.

You might well ask, in that case how can the second thread avoid
segfaulting in the case where ~A deallocates the memory holding the
object and then it is munmapped?  Well, by causing ~A to wait for a
signal of some kind from the second thread before deallocating (viz,
returning).  Such code is still thread-unsafe, though, because the
second thread will have done virtual function calls on an object
whose vtable pointer is being changed by the first thread.

I have observed various races in xpcom along these lines.  They are
also exhibited more simply by xpcom's unit test,
xpcom/tests/TestThreads.cpp.  They are all characterised by:

* one thread does a write, the other a read

* the writing thread is claimed to be at the opening brace of a
  virtual destructor for some class C

* the raced-on address is the first word of an object of class C,
  being destroyed by the writing thread

* the old value (that is just about to be overwritten) is a pointer to
  C's vtable

eg:


Possible data race during write of size 8 at 0xc6c31b0 by thread #1
   at 0x52CC66B: nsThreadStartupEvent::~nsThreadStartupEvent() 
                 (nsThread.cpp:173)
   by 0x528E696: nsRunnable::Release() (nsThreadUtils.cpp:55)
   by 0x52CBB4F: nsThread::Init() (nsThread.cpp:342)
   by 0x52CCA61: nsThreadManager::NewThread(unsigned int, nsIThread**) 
                 (nsThreadManager.cpp:245)
   by 0x528E4FE: NS_NewThread_P(nsIThread**, nsIRunnable*) 
                 (nsThreadUtils.cpp:74)
   by 0x400C62: TestThreads() (TestThreads.cpp:14760)
   by 0x400CEC: main (TestThreads.cpp:14787)

 This conflicts with a previous read of size 8 by thread #2
   at 0x528831B: nsCOMPtr_base::assign_assuming_AddRef(nsISupports*) 
                 (nsCOMPtr.h:456)
   by 0x52CC5D8: nsCOMPtr<nsIRunnable>::operator=(nsIRunnable*) (nsCOMPtr.h:640)
   by 0x52CBC45: nsThread::ThreadFunc(void*) (nsThread.cpp:250)
   by 0x5B56FC2: _pt_root (ptthread.c:230)
   by 0x4C2A5D8: mythread_wrapper (hg_intercepts.c:213)
   by 0x4E34A03: start_thread (pthread_create.c:300)
   by 0x67F880C: clone (clone.S:112)

 Address 0xc6c31b0 is 0 bytes inside a block of size 32 alloc'd
   at 0x4C25B08: malloc (vg_replace_malloc.c:236)
   by 0x5524E8B: moz_xmalloc (mozalloc.cpp:75)
   by 0x52CC5FA: nsThreadStartupEvent::Create() (mozalloc.h:222)
   by 0x52CBAAF: nsThread::Init() (nsThread.cpp:315)
   by 0x52CCA61: nsThreadManager::NewThread(unsigned int, nsIThread**) 
                 (nsThreadManager.cpp:245)
   by 0x528E4FE: NS_NewThread_P(nsIThread**, nsIRunnable*) 
                 (nsThreadUtils.cpp:74)
   by 0x400C62: TestThreads() (TestThreads.cpp:14760)
   by 0x400CEC: main (TestThreads.cpp:14787)

 The old value at address 0xc6c31b0 was 0x551de70
   and 0x551de70 is symbol "vtable for nsThreadStartupEvent+16"
Comment 7 Julian Seward [:jseward] 2010-04-12 02:39:15 PDT
(In reply to comment #6)
> Possible vtable-overwrite races in xpcom

These were analysed and found to be false positives in bug 558402, now
closed-as-invalid.  Issue is Helgrind does not understand that the
last Release-r of an object becomes its exclusive owner and may
arbitrarily mess with it without any locking during its destruction.
This is trivially fixed by annotating NS_IMPL_THREADSAFE_RELEASE.
Comment 8 Julian Seward [:jseward] 2010-04-12 02:57:28 PDT
Update of current status:

About 2/3 of the races reported for "start browser to blank page; wait
for a while; quit" have been triaged.  Outcome is: two false positives
to do with reference counting, and a number of races that are
harmless.  It's not clear that all of the latter were known about or
intended by the code's authors.

Currently there remain untriaged race reports in the following
methods:

  NS_SetGlobalThreadObserver(nsIThreadObserver*)

  nsTimerImpl::SetDelayInternal(unsigned int)
  TimerThread::UpdateFilter
  TimerThread::Run()

  JSBackgroundThread::busy()
  JSBackgroundThread::work()

  nsSocketTransport::OnLookupComplete

  nsHttpTransaction::~nsHttpTransaction()
  nsHttpTransaction::WritePipeSegment

  nsDiskCacheOutputStream::Close()

  base::MessagePumpLibevent::~MessagePumpLibevent()

There's also a possible lock order problem which looks like it might
have to do with /usr/lib/libORBit-2.so.0.1.0.
Comment 9 Robert Sayre 2010-04-12 06:14:10 PDT
(In reply to comment #8)
> 
>   nsDiskCacheOutputStream::Close()

oooh, please look into that one first.
Comment 10 Julian Seward [:jseward] 2010-04-13 15:25:00 PDT
> (In reply to comment #8)
> >   nsDiskCacheOutputStream::Close()
> oooh, please look into that one first.

The race isn't always reported, only on about 50% of startup-then-idle
runs.

Relevant source file is netwerk/cache/src/nsDiskCacheStreams.cpp.

It appears nsDiskCacheOutputStream::mClosed is accessed by multiple
threads without locking.  I'm seeing nsDiskCacheOutputStream::Close
racing against nsDiskCacheOutputStream::Write and also (I suspect)
against itself:


232 NS_IMETHODIMP
233 nsDiskCacheOutputStream::Close()
234 {
235     if (!mClosed) {
236         mClosed = PR_TRUE;
237         // tell parent streamIO we are closing
238         mStreamIO->CloseOutputStream(this);
239     }
240     return NS_OK;
241 }

253 NS_IMETHODIMP
254 nsDiskCacheOutputStream::Write(const char *buf, PRUint32 count, PRUint32 *bytesWritten)
255 {
256     if (mClosed)  return NS_BASE_STREAM_CLOSED;
257     return mStreamIO->Write(buf, count, bytesWritten);
258 }


* Write at line 236 races against read at line 256
  (::Write vs ::Close)

* Write at line 236 races against read at line 235
  (::Close vs ::Close)

* Write at line 236 races against write at line 236
  (::Close vs ::Close)

So imagine 2 threads enter ::Close in lockstep, both find mClosed ==
PR_FALSE, so mStreamIO->CloseOutputStream(this) is done twice (and
without locking) for the same stream.  Is that ok?

Also (probably more serious) imagine two threads that want
to do ::Write and then ::Close.  Initially mClose == PR_FALSE.
T1 enters ::Write, does its write, then enters ::Close,
does ::mClosed = PR_TRUE, then stalls.  T2 enters ::Write,
sees mClosed == PR_TRUE, returns, hence its data is never
written.
Comment 11 Julian Seward [:jseward] 2010-04-13 15:37:16 PDT
Hmm, possibly is more of the same thread-safe-reference-counting
confuses Helgrind problem.  Needs further analysis.
Comment 12 Julian Seward [:jseward] 2010-04-16 04:03:19 PDT
Making sense of race reports is way more difficult than making sense
of memory error reports.  I'll need help from the module owners.
Am adopting the strategy of filing what I have as separate bugs,
marking them UNCONFIRMED for the time being.  If they turn out to
be bogus or harmless they can be closed-as-invalid-ed.

AFAICS triaging a race report can have 3 outcomes:

(1) it's a real race, which is harmful.  This is the only "useful"
    outcome.

(2) it's a real race, but is not dangerous.  Options are to fix it
    anyway, or add an annotation to the source directing Helgrind to
    ignore it.  At a minimum we need to ensure the source contains a
    comment making it clear that the race is known to exist, and
    (preferably) come up with a rigorous argument why the race is
    harmless.

(3) it's not a race.  This happens when Helgrind fails to see some
    inter-thread synchronisation event.  The "fix" is again to annotate
    the Fx source to describe what is going on.

A good example of (3) are Release methods as described in comment 7.
Comment 13 Ben Turner (not reading bugmail, use the needinfo flag!) 2011-02-24 11:10:09 PST
Hi Julian, are you still working on this or has this moved to the far back burner?
Comment 14 Julian Seward [:jseward] 2011-02-24 14:44:30 PST
Working actively on this -- see also bug 547736 comment 17.  This
week I've fixed enough scaling problems in Helgrind to make it
feasible to run substantial chunks of mochitests in one go.  Now
looking at redoing this patch, including redoing annotations of thread
safe ref counting (nsCOMPtr et al), which I realise now I did 
incorrectly in previous attempts.
Comment 15 Julian Seward [:jseward] 2011-03-01 16:40:06 PST
Created attachment 516080 [details] [diff] [review]
WIP patch against M-C of 24 Feb 2011

WIP patch, suitable for Helgrinding SpiderMonkey.  Also greatly
reduces the noise level from a browser startup/quit.

General idea is for the patch to contain a mix of

- annotations which are essential to describe to Helgrind, sync
  operations which it doesn't otherwise understand (thread safe
  refcounting, mostly)

- annotations to hide detected races.

Idea is for the patch to serve as a record of the detected races.
Races that are considered harmful should get filed as bugs and
eventually fixed, in which case their hiding-annotations can be
removed from future versions of the patch.

All annotations are marked with a HGANN comment line, to make them
easily greppable.  Currently there are 26 annotations, of which
6 describe thread-safe refcounting, 7 describe other sync events,
11 hide known-harmless races, and 2 hide non-harmless races.
The complete list is below.  n-i-bz means not-in-bugzilla.

// HGANN 001 BEGIN 551155 HIDES_HARMLESS sEmptyHdr::mLength += 0
// HGANN 001 END

/* HGANN 002 BEGIN REQD_ANN thread-safe-ref-counting */                     \
/* HGANN 002 END */                                                         \

/* HGANN 003 BEGIN REQD_ANN thread-safe-ref-counting */               \
/* HGANN 003 END */                                                   \

// HGANN 004 BEGIN REQD_ANN thread-safe-ref-counting
// HGANN 004 END

/* HGANN 005 BEGIN REQD_ANN thread-safe-ref-counting */
/* HGANN 005 END */

// HGANN 006 BEGIN n-i-bz HIDES_HARMLESS on JSRuntime::various_fields
// HGANN 006 END

// HGANN 007 BEGIN n-i-bz HIDES_HARMLESS on JSThreadData::requestDepth
// HGANN 007 END

// HGANN 008 BEGIN n-i-bz HIDES_HARMLESS on JSContext::outstandingRequests
// HGANN 008 END

// HGANN 009 BEGIN n-i-bz HIDES_HARMLESS on LocalTZA (global var)
// HGANN 009 END

// HGANN 010 BEGIN REQD_ANN js_InitLock
// HGANN 010 END

// HGANN 011 BEGIN REQD_ANN js_FinishLock
// HGANN 011 END

// HGANN 012 BEGIN REQD_ANN ThinLock
// HGANN 012 END

// HGANN 013 BEGIN REQD_ANN ThinUnlock
// HGANN 013 END

// HGANN 014 BEGIN n-i-bz HIDES_HARMLESS on Assembler::nHints (global)
// HGANN 014 END

// HGANN 015 BEGIN n-i-bz HIDES_HARMLESS on Assembler::nHints (global)
// HGANN 015 END

// HGANN 016 BEGIN 559927 REQD_FIX unclear-if-real-bug-nor-if-correct-fix
// HGANN 016 END

// HGANN 017 BEGIN n-i-bz HIDES_HARMLESS on nsSocketTransport::mResolving
// HGANN 017 END

// HGANN 018 BEGIN REQD_ANN thread-safe-ref-counting
// HGANN 018 END

// HGANN 019 BEGIN 560579 REQD_FIX avoid bitfield race in nsHttpTransaction
// HGANN 019 END

// HGANN 020 BEGIN REQD_ANN PRCondVar-send (CHECKME: REQUIRED ??)
// HGANN 020 END

// HGANN 021 BEGIN REQD_ANN PRCondVar-recv (CHECKME: REQUIRED ??)
// HGANN 021 END

// HGANN 022 BEGIN n-i-bz HIDES_HARMLESS on PRMonitor::owner (RECHECK)
// HGANN 022 END

// HGANN 023 BEGIN n-i-bz HIDES_HARMLESS on PRThread::id (RECHECK)
// HGANN 023 END

// HGANN 024 BEGIN n-i-bz HIDES_HARMLESS on nsRecyclingAllocator::mFreeList
// HGANN 024 END

// HGANN 025 BEGIN REQD_ANN nsRecyclingAllocator::Malloc
// HGANN 025 END

/* HGANN 026 BEGIN REQD_ANN thread-safe-ref-counting */
/* HGANN 026 END */
Comment 16 Julian Seward [:jseward] 2011-03-01 16:44:13 PST
Above results achieveable with Valgrind-3.6.1 or trunk.  Trunk is
much preferable due to significant scalability improvements in
Helgrind.  In either case the as-yet-uncommitted patch below is
needed for correct operation.


Index: helgrind/hg_main.c
===================================================================
--- helgrind/hg_main.c	(revision 11576)
+++ helgrind/hg_main.c	(working copy)
@@ -3207,7 +3207,8 @@
    so = map_usertag_to_SO_lookup_or_alloc( usertag );
    tl_assert(so);
 
-   libhb_so_send( thr->hbthr, so, True/*strong_send*/ );
+   //libhb_so_send( thr->hbthr, so, True/*strong_send*/ );
+   libhb_so_send( thr->hbthr, so, False/*!strong_send*/ );
 }
 
 static
Comment 17 Julian Seward [:jseward] 2011-03-02 05:53:19 PST
Created attachment 516246 [details] [diff] [review]
WIP patch against M-C of 24 Feb 2011 (revised)

Adds description and hiding-annotations of 6 further races shown below.
Together with a small suppressions file to hide complainage to do with
Gnome libraries and to do with glibc's DNS lookups, this allows a clean
(no-reported-races) startup to blank page, then exit, of the browser.

Note this doesn't fix any of the races.  It merely continues the process
of cataloguing them.

// HGANN 027 BEGIN n-i-bz HIDES_UNASSESSED on ::sGlobalObserver (global var)
// HGANN 027 END

// HGANN 028 BEGIN n-i-bz HIDES_UNASSESSED on ns[..]::mInitialized (d-c locking)
// HGANN 028 END

// HGANN 029 BEGIN n-i-bz HIDES_UNASSESSED on Canary::sOutputFD (global var)
// HGANN 029 END

// HGANN 030 BEGIN 637461 HIDES_UNASSESSED on StartupCache::mWriteThread
// HGANN 030 END

// HGANN 031 BEGIN 559793 HIDES_UNASSESSED on TimerThread::mTimeoutAdjustment
// HGANN 031 END

// HGANN 032 BEGIN 559793 HIDES_UNASSESSED on nsTimerImpl::mArmed
// HGANN 032 END
Comment 18 Julian Seward [:jseward] 2011-03-02 05:59:00 PST
Created attachment 516248 [details]
suppressions file referred to in comment 17

For use w/ valgrind trunk on x64-linux (Ubuntu 10.04.2)
Comment 19 Julian Seward [:jseward] 2011-03-23 09:25:07 PDT
Created attachment 521198 [details] [diff] [review]
WIP patch against M-C of 23 Mar 2011
Comment 20 Julian Seward [:jseward] 2011-03-23 09:36:06 PDT
Created attachment 521210 [details]
Suppressions to go with markup in comment #19 (for Ubuntu 10.04 64 bit)
Comment 21 Julian Seward [:jseward] 2011-04-06 12:46:19 PDT
Created attachment 524238 [details] [diff] [review]
WIP patch against M-C of 6 Apr 2011
Comment 22 Christian Holler (:decoder) 2012-05-08 12:17:36 PDT
What's required to get this into the tree? It would be very nice to have this also from a security standpoint.
Comment 23 Julian Seward [:jseward] 2012-05-08 14:34:49 PDT
Created attachment 622146 [details] [diff] [review]
WIP patch against M-C of 23 Apr 2012
Comment 24 ISHIKAWA, Chiaki 2012-11-05 05:25:37 PST
Created attachment 678282 [details]
Re-created patch against later C-C mozilla subdirectory (with USEHELGRIND macro)

While I am investigating strange usage of uninitialized value
(Bug 803816 - Valgrind warnings about uninitialised memory use (Thunderbird)), I tried to run thunderbird under helgrind.

After trying to apply the patch automatically and saw there were many FAILED hunks, I re-created the patch against ./mozilla subdirectory of comm-central.

The version of mozilla I used is (in comm-central/mozilla subdirectory)

    $ hg identify
    a517f7ea5bef+ tip

I enclosed the changes into

#ifdef USEHELGRIND
       ...
#endif

appropriately as much as possible.

In order to enable the if-defed parts, when I ran configure and make I added -DUSHELGRIND=1 to CC and CXX macros manually. (Someone might be able to figure out how to add these to compiler flags automagically).

E.g., like this in a shell script:

export CC="/usr/bin/gcc -DDEBUG_4GB_CHECK -DUSEHELGRIND=1"
export CXX="/usr/bin/g++ -DDEBUG_4GB_CHECK -DUSEHELGRIND=1"
time make -f client.mk \
    CC="/usr/bin/gcc -DDEBUG_4GB_CHECK -DUSEHELGRIND=1" \
    CXX="/usr/bin/g++ -DDEBUG_4GB_CHECK -DUSEHELGRIND=1" \
    $*


For creating the patch, I found a few issues:

However, there is one header file which is not amenable to a small patch using #if/#else/#endif like this. Also, there was an assembler source file which I left as is.

For one file, already a patch has been incorporated into the repository to include the suggested mlock and so I did not bother to do anything for this file.

>patching file netwerk/base/src/nsAsyncStreamCopier.cpp
>Hunk #1 FAILED at 272.
>1 out of 1 hunk FAILED -- saving rejects to file netwerk/base/src/nsAsyncStreamCopier.cpp.rej


The following file may be a problem.  I found that no code remains at all in the header file that looks like the ones that was patched. (???) So this change is not in the re-created patch at all. I have no idea what is going on here.

>patching file netwerk/dns/nsHostResolver.h
>Hunk #1 FAILED at 62.
>1 out of 1 hunk FAILED -- saving rejects to file netwerk/dns/nsHostResolver.h.rej

>Assembly Source for 64 bits:?
>patching file nsprpub/pr/src/md/unix/os_Linux_x86_64.s
>Hunk #1 succeeded at 26 (offset -32 lines).

>Use of IF/ENDIF may not be possible in macro definition.
>patching file xpcom/glue/nsISupportsImpl.h

Hope this may help the inclusion of the code into the main repository. Someone ought to check the operation of thunderbird and firefox occasionally under valgrind (memgrind, helgrind, etc.)
Comment 25 ISHIKAWA, Chiaki 2012-11-05 05:28:17 PST
Created attachment 678284 [details] [diff] [review]
Re-created patch against later C-C mozilla subdirectory (with USEHELGRIND macro)

(I am uploading the patch again as "patch" so that "diff" display feature can be used.)
I am obsoleting the previous upload.
TIA
Comment 26 ISHIKAWA, Chiaki 2012-12-29 06:12:43 PST
I think that the portion of the patch to xpcom/base/nsAtomicRefcnt.h below
got the type wrong: i.e., nsrefcnt needs to be replaced with int32_t in two places. I am running with such modification for now.


-inline int32_t
+#ifndef USEHELGRIND
+inline nsrefcnt
+#else
+__attribute__((noinline)) static nsrefcnt
+#endif
 NS_AtomicDecrementRefcnt(int32_t &refcnt)
 {
   return PR_ATOMIC_DECREMENT(&refcnt);
 }
Comment 27 Julian Seward [:jseward] 2013-01-08 09:00:59 PST
Created attachment 699268 [details] [diff] [review]
A minimalist starting-off patch, for consideration

This is a minimal starting-off patch, for consideration.  It is the
first of a series of patches which makes Gecko race-detector friendly.

The patch describes the synchronisation effects caused by threadsafe
ref counting, as this is not something tools can themselves infer.  In
effect it says that the thread that causes a 1->0 refcount transition
acquires exclusive ownership of the object, so that subsequent
accesses to it (to destruct it) can legitimately be done without
locking.

This is for the classes: base::subtle::RefCountedBase,
nsHttpTransaction and (most importantly) nsISupports.

The patch also tightens up the relevant feature test
(--enable-valgrind) to ensure all the required headers are present.

The annotations are commented in a particular way, that makes it
possible to get a summary of all annotations in the code base by
grepping for the string HGANN.  This is important because there will
eventually be about 30 ish annotations needed, most of which hide
known races.
Comment 28 Benjamin Smedberg [:bsmedberg] 2013-01-10 09:41:45 PST
Comment on attachment 699268 [details] [diff] [review]
A minimalist starting-off patch, for consideration

This is fine by me, but I think cjones is probably the best person to provide feedback, since he's my expert on concurrent programming.
Comment 29 Chris Jones [:cjones] inactive; ni?/f?/r? if you need me 2013-01-16 11:41:56 PST
Comment on attachment 699268 [details] [diff] [review]
A minimalist starting-off patch, for consideration

The approach looks fine.

Do helgrind/VEX have any provisions to run the HB/HA client requests atomically with the atomic memory mutation?
Comment 30 Julian Seward [:jseward] 2013-01-21 04:28:09 PST
(In reply to Chris Jones [:cjones] [:warhammer] from comment #29)
> Do helgrind/VEX have any provisions to run the HB/HA client requests
> atomically with the atomic memory mutation?

No, it doesn't.  In this case (to do with threadsafe refcounting) I don't
think it matters much if we say that the HB event happens slightly before
the atomic op and the HA happens after, so long as it doesn't report a
race on the atomically modified location (mRefCnt) itself, which it
indeed doesn't.
Comment 31 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-03-01 17:04:14 PST
Julian, let's get the minimal patch in. Is cjones' f+ enough here?
Comment 32 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-03-04 19:04:34 PST
NSPR changes moved to bug 847764.
Comment 33 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-03-06 04:58:49 PST
MFBT patch in bug 848303.
Comment 34 Ben Turner (not reading bugmail, use the needinfo flag!) 2013-03-06 05:25:54 PST
XPCOM refcounting changes pushed to bug 848305.
Comment 35 Chris Peterson [:cpeterson] 2014-06-13 23:52:04 PDT
*** Bug 547736 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.