Closed Bug 1121216 Opened 9 years ago Closed 9 years ago

TSan: data races in xpcom/threads/ThreadStackHelper.cpp (many functions)

Categories

(Core :: XPCOM, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox43 --- fixed

People

(Reporter: froydnj, Assigned: froydnj)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

TSan reports interesting data races in ThreadStackHelper, for instance:

WARNING: ThreadSanitizer: data race (pid=6877)
  Write of size 8 at 0x7d8000157000 by main thread (mutexes: write M3836):
    #0 memcpy ??:0 (exe+0x000000029be3)
    #1 FillThreadContext /home/froydnj/src/gecko-dev.git/xpcom/threads/ThreadStackHelper.cpp:842 (libxul.so+0x0000009ada32)
    #2 FillStackHandler /home/froydnj/src/gecko-dev.git/xpcom/threads/ThreadStackHelper.cpp:492 (libxul.so+0x0000009ad173)
    #3 __tsan::ProcessPendingSignals(__tsan::ThreadState*) ??:0 (exe+0x000000027455)
    #4 init /home/froydnj/src/gecko-dev.git/js/src/jscntxt.h:973 (libxul.so+0x0000049dbbdb)
    ...many frames elided...
  Previous write of size 8 at 0x7d8000157000 by thread T26 (mutexes: write M7292):
    #0 malloc ??:0 (exe+0x000000028846)
    #1 moz_xmalloc /home/froydnj/src/gecko-dev.git/memory/mozalloc/mozalloc.cpp:52 (libmozalloc.so+0x000000001a28)
    #2 operator new[] /opt/build/froydnj/build-tsan/xpcom/threads/../../dist/include/mozilla/mozalloc.h:221 (libxul.so+0x0000009aa0a0)
    #3 ReportPermaHang /home/froydnj/src/gecko-dev.git/xpcom/threads/BackgroundHangMonitor.cpp:395 (libxul.so+0x0000009a927a)
    #4 MonitorThread /home/froydnj/src/gecko-dev.git/xpcom/threads/BackgroundHangMonitor.cpp:52 (libxul.so+0x0000009b9eb1)
    #5 _pt_root /home/froydnj/src/gecko-dev.git/nsprpub/pr/src/pthreads/ptthread.c:212 (libnspr4.so+0x00000003e270)

or perhaps:

WARNING: ThreadSanitizer: data race (pid=6835)
  Write of size 8 at 0x7d6800018678 by main thread:
    #0 FillStackBuffer /home/froydnj/src/gecko-dev.git/xpcom/threads/ThreadStackHelper.cpp:691 (libxul.so+0x0000009ad72b)
    #1 FillStackHandler /home/froydnj/src/gecko-dev.git/xpcom/threads/ThreadStackHelper.cpp:491 (libxul.so+0x0000009ad168)
    #2 __tsan::ProcessPendingSignals(__tsan::ThreadState*) ??:0 (exe+0x000000026d45)
    #3 BackgroundHangMonitor /home/froydnj/src/gecko-dev.git/xpcom/threads/BackgroundHangMonitor.cpp:530 (libxul.so+0x0000009ab43f)
    ...many frames elided....
  Previous read of size 8 at 0x7d6800018678 by thread T25 (mutexes: write M6523):
    #0 PrepareStackBuffer /home/froydnj/src/gecko-dev.git/xpcom/threads/ThreadStackHelper.cpp:517 (libxul.so+0x0000009a959f)
    #1 RunMonitorThread /home/froydnj/src/gecko-dev.git/xpcom/threads/BackgroundHangMonitor.cpp:275 (libxul.so+0x0000009a92d0)
    #2 MonitorThread /home/froydnj/src/gecko-dev.git/xpcom/threads/BackgroundHangMonitor.cpp:52 (libxul.so+0x0000009b9eb1)
    #3 _pt_root /home/froydnj/src/gecko-dev.git/nsprpub/pr/src/pthreads/ptthread.c:212 (libnspr4.so+0x00000003e270)

I think this code is non-racy and TSan just doesn't have enough information to figure out what's going on.

The problem is thus: we'd like to take a snapshot of some thread's stack.  So we allocate some memory for the stack, and then ask ThreadStackHelper to fill it in:

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/ThreadStackHelper.cpp#233

we do this in a kind of unusual fashion: 

1) We enqueue a real-time signal to the current thread and send it off;
2) We ::sem_wait until the signal handler has executed;
3) The signal handler does all the grotty bits, and ::sem_post's when it's done:

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/ThreadStackHelper.cpp#486

We do this, as best as I can tell, so that we guarantee that the calling thread has suspended itself, and we can walk its stack in relevant peace.

TSan complains because we are preparing the stack to record on one thread (the BackgroundHangMonitor thread in the above cases), and scribbling over it on another.  But AFAICS, there's no racy accesses here; everything is synchronized properly by the sem_{wait,post} calls.  TSan doesn't realize that these calls are essentially functioning as a mutex to guard access to the stack.

I *think* we could convince TSan of the safety of what's happening here by rewriting everything to use pthread condition variables; it looks like there's extra processing that takes place with pthread mutexes and condition variables that doesn't happen with basic semaphores.

Except that we'd need to pthread_cond_signal from the signal handler, and the man page is very explicit that pthread_cond_signal is *not* async-signal safe.  Which is presumably why semaphores were used for this case in the first place.

I dislike doing it, but I think the best way forward here is to write some detailed comments in ThreadStackHelper.cpp and liberally apply MOZ_TSAN_BLACKLIST.  WDYT?
I don't see any benefit to changing this code. We've been shipping it for several years now and there's no signs of outstanding problems. Getting this code right on all platforms was fairly tedious (you only discuss Linux here) and changing it just to make sure TSAN doesn't complain is a poor reason. Of course if we have any doubt then things are different.

I'm happy whitelisting / ignoring the errors that apply to this code.
TSan complains about signal handlers that clobber errno.  This complaint seems
reasonable enough, and I don't think we can work around it by blacklisting the
function (as doing that only prevents memory accesses in the function from
being instrumented, not anything dealing with runtime behavior).  So we'll do
the simple thing here.
Attachment #8642561 - Flags: review?(bgirard)
Here's another approach: disable BHM entirely for TSan builds.  This isn't a
great solution, as we're disabling functionality that we actually ship in
release builds.  However, even with the patches in this bug that blacklist
methods from ThreadStackHelper, one still sees races between various Vector<>
methods called on behalf of (blacklisted) ThreadStackHelper functions.  And
those ThreadStackHelper methods are called on behalf of the BHM.  I don't think
there's a better way to get rid of them, given that TSan's runtime blacklisting
mechanism is...not very powerful.

Julian. especially interested to hear what you have to say about this.
Attachment #8642601 - Flags: review?(nchen)
Attachment #8642601 - Flags: review?(jseward)
Comment on attachment 8642601 [details] [diff] [review]
disable BackgroundHangMonitor for TSan builds

Review of attachment 8642601 [details] [diff] [review]:
-----------------------------------------------------------------

I think this is a reasonable compromise.
Attachment #8642601 - Flags: review?(nchen) → review+
Attachment #8642562 - Flags: review?(bgirard) → review+
Comment on attachment 8642561 [details] [diff] [review]
part 1 - save and restore errno in ThreadStackHelper's signal handler

Review of attachment 8642561 [details] [diff] [review]:
-----------------------------------------------------------------

r+ as-is but wouldn't it be better to move this around the call that modifiers errno? It's not clear form this snippet that is modifying errno.
Attachment #8642561 - Flags: review?(bgirard) → review+
https://hg.mozilla.org/mozilla-central/rev/6f1eb6ef900a
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
Assignee: nobody → nfroyd
Comment on attachment 8642601 [details] [diff] [review]
disable BackgroundHangMonitor for TSan builds

Since this has already landed, it's not worth having Julian review it.
Attachment #8642601 - Flags: review?(jseward)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: