TSan: data race ipc/glue/MessagePump.cpp:142 ScheduleWork (race on mThread)

RESOLVED FIXED in Firefox 48

Status

()

defect
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: jseward, Assigned: billm)

Tracking

(Blocks 1 bug, {csectype-race, sec-moderate})

Trunk
mozilla48
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox45 affected, firefox48 fixed, firefox-esr45 wontfix)

Details

(Whiteboard: [tsan][post-critsmash-triage][adv-main48+])

Attachments

(2 attachments, 2 obsolete attachments)

Posted file TSan stack trace
The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

* Specific information about this bug

STR: apply test harness patch from bug 1222043m then run:

TSAN_OPTIONS=suppressions=/home/sewardj/MOZ/SUPPS/tsan-1.txt DISPLAY=:1.0 ./mach xpcshell-test --sequential --log-mach - dom/base/test/unit_ipc/test_bug553888_wrap.js 2>&1 | tee spew-53-tsan-1

Looks like a race on MessagePump::mThread.

* General information about TSan, data races, etc.

Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2].

If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
A bit of analysis.  What we have is:

toolkit/xre/nsEmbedFunctions.cpp:556: a MessageLoop (containing MessagePump?) object is
created.  In particular, uiMessageLoop.mThread is set to nullptr at this
point.

555    // Associate this thread with a UI MessageLoop
556    MessageLoop uiMessageLoop(uiLoopType);
557    {

A child thread is created:

574          process = new ContentProcess(parentPID);

The child goes off and reads uiMessageLoop.mThread, uncoordinated:
ipc/glue/MessagePump.cpp:142

141  // Make sure the event loop wakes up.
142  if (mThread) {
143    mThread->Dispatch(mDoWorkEvent, NS_DISPATCH_NORMAL);

Meanwhile, uncoordinated, some time later, the parent writes uiMessageLoop.mThread:

622      // Run the UI event loop on the main thread.
623      uiMessageLoop.MessageLoop::Run();
         (leads to "mThread = NS_GetCurrentThread()" at MessagePump.cpp:84

It's unclear to me whether the author(s) intend a race to exist, and
think it is OK, or whether they are/were unaware of the existence of
the race.
Looking more at MessagePump::ScheduleWork() in ipc/glue/MessagePump.cpp,
it looks like this race could cause events (mDoWorkEvent) to mistakenly be
delivered to the main thread rather than to the thread that mThread will
eventually refer to.  This seems bad.
Thanks Julian. This does seems like a very serious bug. I'm going to close it for now.

I think we should be able to initialize mThread in the MessagePump constructor. I'll post a patch tomorrow or Monday.
Assignee: nobody → wmccloskey
Group: core-security
Flags: needinfo?(wmccloskey)
Actually, I guess this isn't a big deal. mThread is just going to end up being the main thread anyway. We should still fix it, but it's probably not causing anyone any trouble.
Flags: needinfo?(wmccloskey)
That seems to imply that the only possible mThread transition is nullptr --> main_thread.
If so, would it be possible to simply bake the main thread ID into the MessagePump when it
is created?

Anyway, am happy to test a patch as/when available.
Group: core-security → dom-core-security
Posted patch patch (obsolete) — Splinter Review
Does this fix the problem?
Attachment #8725487 - Flags: feedback?(jseward)
(In reply to Bill McCloskey (:billm) from comment #6)
> Does this fix the problem?

Yes, it does.  Thanks!
Posted patch patch v2 (obsolete) — Splinter Review
The tricky issue here is that the current thread sometimes isn't available in the MessagePump constructor because XPCOM hasn't been initialized yet. I'm guessing that's why we initialize it in ::Run instead.

I was a little concerned that there could be a race here where we call ScheduleWork before the thread manager is initialized. However, ContentChild does that explicitly to avoid that situation:
https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#666
The Open call on the first channel doesn't happen until after the thread manager is initialized, so we're safe there.

There's also a question whether this could be a race for non-main threads. I made some changes to pass the thread in explicitly for non-main threads. I think there might be a small window here where the message loop is exposed outside the thread and mThread hasn't been set yet:
https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/thread.cc#164
Attachment #8725487 - Attachment is obsolete: true
Attachment #8725487 - Flags: feedback?(jseward)
Attachment #8726357 - Flags: review?(jld)
Comment on attachment 8726357 [details] [diff] [review]
patch v2

Review of attachment 8726357 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for taking so long.  I've been staring at this (and, more importantly, the event loop code) for a while, and I'm not completely convinced I understand everything I should.  But, if I understand correctly enough, there are a few cases:

* The main thread, in which case mThread is (after this patch) always nullptr and NS_DispatchToMainThread is used (I assume that does something reasonable if called early?).

* A non-main thread, but its nsIThread exists when it's started, so mThread is initialized in the constructor and everything is fine.

* A non-main thread started before nsIThreads exist.  (Is this just the child process's IPC I/O thread?  Are there other cases?  Does PBackground always wind up in the previous case?)  Here mThread is initially nullptr and later changed to the actual thread at the top of MessagePump::Run.  Could this be explicitly distinguished from the main-thread case somehow?  It would be good if we could release assert or otherwise fail instead of quietly dispatching a runnable to the wrong thread.

In any case, it looks like an improvement on the current situation.

::: ipc/glue/MessagePump.cpp
@@ +79,5 @@
> +{
> +  if (!NS_IsMainThread()) {
> +    // Don't write to mThread if we don't have to. It can cause a race if
> +    // ScheduleWork runs on another thread.
> +    mThread = NS_GetCurrentThread();

If mThread is already set to a non-null value from the constructor, shouldn't this skip the write?  (And maybe assert or release-assert that it's the correct thread.)

::: ipc/glue/MessagePump.h
@@ +120,5 @@
>    NS_DECL_NSITHREADOBSERVER
>  
>  public:
> +  MessagePumpForNonMainUIThreads(nsIThread* aThread) :
> +    MessagePump(aThread),

Does this work?  It looks like MessagePumpForNonMainUIThreads doesn't inherit from MessagePump.
Attachment #8726357 - Flags: review?(jld) → review+
(In reply to Bill McCloskey (:billm) from comment #8)
> Created attachment 8726357 [details] [diff] [review]
> patch v2

This patch appears to remove the TSan complaints.  +1 from me, therefore.
This looks like the problem: https://hg.mozilla.org/integration/mozilla-inbound/rev/00f8c8fde8ca0e3d2816cf3ddc3f6e26b5070c52#l4.12

The fix should be a simple matter of adding "explicit" to those 1-argument constructors that used to be 0-argument.
Flags: needinfo?(wmccloskey) → needinfo?(jld)
Posted patch patch v3Splinter Review
Attachment #8726357 - Attachment is obsolete: true
Flags: needinfo?(jld)
Attachment #8736540 - Flags: review+
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7afa9ef11df6 (shown here as two patches, but I squashed them for the patch I just uploaded).  I didn't rerun tests (besides building) because the incremental change shouldn't affect anything besides the static-analysis failure.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/c91015c43d55
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Group: dom-core-security → core-security-release
Whiteboard: [tsan] → [tsan][post-critsmash-triage]
Whiteboard: [tsan][post-critsmash-triage] → [tsan][post-critsmash-triage][adv-main48+]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.