Closed Bug 1222782 Opened 4 years ago Closed 4 years ago

TSan: data race netwerk/cache2/CacheIOThread.cpp:148 Target (race on mXPCOMThread)

Categories

(Core :: Networking: Cache, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox45 --- fixed

People

(Reporter: jseward, Assigned: michal)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

Attached file TSan stack trace
STR: apply test harness patch in bug 1222043, then run (with a TSan'd build of course)
./mach xpcshell-test --sequential --log-mach - browser/components/dirprovider/tests/unit/test_bookmark_pref.js

The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

* Specific information about this bug

This is a race on CacheIOThread::mXPCOMThread.  CacheIOThread::ThreadFunc() writes to it ..

    MonitorAutoLock lock(mMonitor);
    ...
    mXPCOMThread.swap(xpcomThread);

whilst holding the mMonitor lock for this object.  But CacheIOThread::Target() reads it
without holding that lock .. it holds some other (I assume unrelated) lock instead:

  target = mXPCOMThread;
  if (!target && mThread)
  {
    MonitorAutoLock lock(mMonitor);
    if (!mXPCOMThread)
      lock.Wait();

    target = mXPCOMThread;
  }

If we move the acquisition of mMonitor out of the conditional and put it before the
first line of this fragment, the read of mXPCOMThread is then suitably protected, and TSan stops complaining.

* General information about TSan, data races, etc.

Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2].

If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Attached patch bug1222782-1.diff (obsolete) — Splinter Review
Proposed fix.  Moves the acquisition of mMonitor a little earlier in Target().
This race is reported 1527 times out of a total of 18693 races
reported during a complete xpcshell test run.  That makes it the
3rd most common race reported.
I think this is a duplicate of bug 1190969?  But this bug actually has a plausible fix, so maybe we should dup the other bug to this one.
Flags: needinfo?(jseward)
(In reply to Nathan Froyd [:froydnj] from comment #3)
> I think this is a duplicate of bug 1190969?

Plausible, but hard to say for sure, because bug 1190969 doesn't have a
stack trace to look at.

Maybe we should just fix this and then see if bug 1190969 is still alive
after that.
Flags: needinfo?(jseward)
Attachment #8684638 - Flags: review?(michal.novotny)
Attached patch patch (obsolete) — Splinter Review
(In reply to Julian Seward [:jseward] from comment #1)
> Proposed fix.  Moves the acquisition of mMonitor a little earlier in
> Target().

I'm pretty sure Honza wouldn't like this fix. I guess making mXPCOMThread atomic should fix the data race too while being a bit more efficient.
Attachment #8686347 - Flags: review?(jseward)
(In reply to Michal Novotny (:michal) from comment #5)
> I'm pretty sure Honza wouldn't like this fix. I guess making mXPCOMThread
> atomic should fix the data race too while being a bit more efficient.

For a complete xpcshell test run, the proposed patch adds 36000 monitor entries
and exits, which presumably means 72000 global bus operations.  Spread over many
minutes (~ 30) of CPU time I would think it very unlikely that it would be possible
to measure any performance loss.

> I guess making mXPCOMThread atomic should fix the data race

I am not convinced of that.  IIUC, atomics only de-race certain operations such as
add-a-number-to-a-memory location.  In this case the read side of the race looks
like this

  target = mXPCOMThread;
  if (!target && mThread)
  {
    MonitorAutoLock lock(mMonitor);
    if (!mXPCOMThread)
      lock.Wait();

and making mXPCOMThread atomic won't do anything to stop the two uses of it here
getting inconsistent values due to racing against a writer.
(In reply to Julian Seward [:jseward] from comment #6)
> For a complete xpcshell test run, the proposed patch adds 36000 monitor
> entries
> and exits, which presumably means 72000 global bus operations.  Spread over
> many
> minutes (~ 30) of CPU time I would think it very unlikely that it would be
> possible
> to measure any performance loss.

Maybe, but please note that cache isn't used too much during xpcshell tests when compared to normal firefox run.

> > I guess making mXPCOMThread atomic should fix the data race
> 
> I am not convinced of that.  IIUC, atomics only de-race certain operations
> such as
> add-a-number-to-a-memory location.  In this case the read side of the race
> looks
> like this
> 
>   target = mXPCOMThread;
>   if (!target && mThread)
>   {
>     MonitorAutoLock lock(mMonitor);
>     if (!mXPCOMThread)
>       lock.Wait();
> 
> and making mXPCOMThread atomic won't do anything to stop the two uses of it
> here
> getting inconsistent values due to racing against a writer.

It might be that I don't understand enough how atomic works. I thought that making mXPCOMThread atomic ensures that CacheIOThread::Target() will see the assigned pointer immediately once it is set on the different thread in CacheIOThread::ThreadFunc().
(In reply to Michal Novotny (:michal) from comment #7)
> Maybe, but please note that cache isn't used too much during xpcshell tests
> when compared to normal firefox run.

Some more numbers, with the proposed patch applied.  The test is: start the
browser with an empty cache, load news.bbc.co.uk, wait till it loads, and then
re-load 3 times.

This requires 47.54 million global bus events, of which about half are from
malloc and free.  The cost incurred by mozilla::net::CacheIOThread::Target()
and the functions it calls, is 21695 global bus events.  That is, about 0.046%
of the total.
(In reply to Michal Novotny (:michal) from comment #5)
> Created attachment 8686347 [details] [diff] [review]
> patch
> I'm pretty sure Honza wouldn't like this fix. I guess making mXPCOMThread
> atomic should fix the data race too while being a bit more efficient.

Ah, sorry.  I didn't see this new patch at first.  It does remove the race.
The profile data is difficult to interpret but I suspect it is more efficient
than the patch I proposed in comment #2.
Comment on attachment 8686347 [details] [diff] [review]
patch

The patch does remove the race, so I am r+'ing it on that basis.  I can't
say anything about the wider implications of the patch, except one thing:

+    ((nsIThread *)mXPCOMThread)->Release();

+          rv = ((nsIThread *)mXPCOMThread)->ProcessNextEvent(false,

These casts are ugly and IMO dangerous; what happens if one day mXPCOMThread
stops being an nsIThread* ?  Please remove them if possible.  Are they actually
necessary, given that mXPCOMThread is declared as follows?

+  Atomic<nsIThread *> mXPCOMThread;
Attachment #8686347 - Flags: review?(jseward) → review+
I get following error when I don't use the cast:

/opt/moz/hg-inbound/netwerk/cache2/CacheIOThread.cpp: In member function ‘void mozilla::net::CacheIOThread::ThreadFunc()’:
/opt/moz/hg-inbound/netwerk/cache2/CacheIOThread.cpp:211:28: error: base operand of ‘->’ has non-pointer type ‘mozilla::Atomic<nsIThread*>’
           rv = mXPCOMThread->ProcessNextEvent(false, &processedEvent);

Instead of casting I can do:

  nsIThread *thread = mXPCOMThread;
  rv = thread->ProcessNextEvent(false, &processedEvent);

It is safer because the assignment would fail to compile if declaration of mXPCOMThread changes. Is this OK?
Flags: needinfo?(jseward)
(In reply to Michal Novotny (:michal) from comment #11)
> Is this OK?

Yes that's fine.  Thanks for fixing it.
Flags: needinfo?(jseward)
Attached patch patch v2Splinter Review
Assignee: nobody → michal.novotny
Attachment #8684638 - Attachment is obsolete: true
Attachment #8686347 - Attachment is obsolete: true
Attachment #8684638 - Flags: review?(michal.novotny)
Attachment #8687296 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/058f15f9cf5e
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
Thanks for the patch here!  I was thinking of something similar for CacheFile::mHandle race, just have an Atomic<T*> and nsRefPtr<T> side by side.  The atomic to provide access and refptr to automate the release.
Duplicate of this bug: 1190969
You need to log in before you can comment on or make changes to this bug.