Closed Bug 1157323 Opened 9 years ago Closed 8 years ago

TSan: data races in xpcom/threads/nsTimerImpl.cpp InitCommon / GetGeneration / ReleaseTimerInternal / IsRepeatingPrecisely

Categories

(Core :: XPCOM, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: froydnj, Assigned: bwc)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tsan])

Attachments

(7 files, 1 obsolete file)

The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

* Specific information about this bug

nsTimerImpl has a bunch of variables that aren't atomic (mType, mArmed, mGeneration), but are accessed on both the thread for the timer and the TimerThread.  People have known about this, generally with some sense of "boy, that's awful!" when they look at the timer code, but now TSan is complaining about it.

The locks we take might be inadvertently making these safe, but I doubt it.

* General information about TSan, data races, etc.

Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2].

If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Attached file timer-race.txt
Probably more helpful to understand the races if the TSan reports are attached.
Anyone have a working TSan build that they can try this out with?
Assignee: nobody → docfaraday
Yeah.  I do.  I can try it out in the morning.
Ok, figured out my tsan goof, TestTimers looks good now with tsan. I can run some more tests.
Fixing this race sounds like a good thing.  I should perhaps point out,
though, I cannot reproduce it in TSan runs of the browser, either for
a bit of ad-hoc surfing, or in a mochitest run.

Nathan, do you have any STR for it?
Flags: needinfo?(nfroyd)
(In reply to Julian Seward [:jseward] from comment #7)
> Fixing this race sounds like a good thing.  I should perhaps point out,
> though, I cannot reproduce it in TSan runs of the browser, either for
> a bit of ad-hoc surfing, or in a mochitest run.
> 
> Nathan, do you have any STR for it?

I have no STR, other than mochitest runs.  I can't recall whether I've tried ad-hoc surfing to trigger the race or not.
Flags: needinfo?(nfroyd)
What is the status of this, now?  Can it land?
I'm still running TSAN tests.
Attachment #8773017 - Flags: review?(nfroyd)
Attachment #8773018 - Flags: review?(nfroyd)
Comment on attachment 8773017 [details]
Bug 1157323 - Part 1: Make TestTimers a unit-test.

https://reviewboard.mozilla.org/r/65656/#review66404
Attachment #8773017 - Flags: review?(nfroyd) → review+
Comment on attachment 8773018 [details]
Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release.

https://reviewboard.mozilla.org/r/65658/#review66406

Please try and separate the straight-up refactoring changes from the actual logic changes.  Reviewing timer changes is hard enough as-is.

Some explanation (preferably in code comments, but commit messages work too) of exactly what's being done and why that de-races things would be splendid.
Attachment #8773018 - Flags: review?(nfroyd)
Review commit: https://reviewboard.mozilla.org/r/69754/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/69754/
Attachment #8773018 - Attachment description: Bug 1157323 - Part 2: De-race nsTimerImpl. → Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release.
Attachment #8773018 - Flags: review?(nfroyd)
Comment on attachment 8773017 [details]
Bug 1157323 - Part 1: Make TestTimers a unit-test.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65656/diff/1-2/
Comment on attachment 8773018 [details]
Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release.

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65658/diff/1-2/
Attachment #8778452 - Flags: review?(nfroyd)
Attachment #8778453 - Flags: review?(nfroyd)
Attachment #8778454 - Flags: review?(nfroyd)
Attachment #8778455 - Flags: review?(nfroyd)
Attachment #8778456 - Flags: review?(nfroyd)
Thank you for splitting these apart!  I'll take a look at them on Wednesday.
Comment on attachment 8773018 [details]
Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release.

https://reviewboard.mozilla.org/r/65658/#review68110

What did TSan say about all these patches?  Did they come out clean, as per comment 11?  (Not that TSan cleanness is a guarantee of de-raced code, merely that it didn't observe any races...)

Still trying to swap timer state back in, preliminary comments below.

::: xpcom/threads/nsTimerImpl.h:99
(Diff revision 2)
>      }
>    }
>  
> +  void Destroy()
> +  {
> +    Cancel();

This bit and related things make me uneasy.  What happens in something like the following:

TimerThread: Timer goes off, TimerEvent posted to the target.
TargetThread: TimerEvent starts running, firing the timer.
TargetThread: Firing the timer gets past the `mCanceled` check.
OtherThread: Last reference to the associated nsTimer is released, Destroy()'ing the nsTimerImpl that's in the process of firing.

One of the scenarios that worries me (in addition to possible crashes) is that we can take a strong reference to `mITimer` in `nsTimerImpl::Fire` while step 3 above is in-process, which means that we're holding memory that's about to get freed from underneath us.

I think this sort of thing isn't a concern with the old code because things that the timer events/timer thread holds on to are the same things as the client code.  But with this new code, there's a separation, and it's possible for the two things to get out of sync.

::: xpcom/threads/nsTimerImpl.cpp:115
(Diff revision 2)
> -NS_IMPL_QUERY_INTERFACE(nsTimerImpl, nsITimer)
> -NS_IMPL_ADDREF(nsTimerImpl)
> +NS_IMPL_QUERY_INTERFACE(nsTimer, nsITimer)
> +NS_IMPL_ADDREF(nsTimer)
> +NS_IMPL_RELEASE(nsTimer)

This code can now be replaced by:

```NS_IMPL_ISUPPORTS(nsTimer, nsITimer)
```

now that we don't have to do funny things with Release().
Attachment #8773018 - Flags: review?(nfroyd)
Comment on attachment 8773018 [details]
Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release.

https://reviewboard.mozilla.org/r/65658/#review68760

::: xpcom/threads/nsTimerImpl.cpp:123
(Diff revisions 2 - 3)
> +  if (count == 1) {
> +    // Last ref, held by nsTimerImpl. Break the cycle.
> +    mImpl->Destroy();
> +    return 0;
> +  }

In the previous iteration of this patch, we had the following race:

TimerThread: Timer goes off, TimerEvent posted to the target.
TargetThread: TimerEvent starts running, firing the timer.
TargetThread: Firing the timer gets past the mCanceled check.
OtherThread: Last reference to the associated nsTimer is released, Destroy()'ing the nsTimerImpl that's in the process of firing.

And, for completeness, the callstack on OtherThread at that point looked like:

```
nsTimer::Release()
  nsTimer::~nsTimer()
    nsTimerImpl::Destroy()
```

With this patch, don't we have the same problem?  OtherThread releases the timer, and then we have:

```
nsTimer::Release()
  nsTimerImpl::Destroy()
    nsTimer::Release() // nulling out mITimer
      nsTimer::~nsTimer()
```

and we've yanked out the strong reference that `nsTimerImpl` thought it was holding out from underneath it.  Just because the last reference is being held by nsTimerImpl doesn't mean that it's safe to drop it when it's the only reference left.  There might be code running that still depends on that reference (e.g. the switch on the callbackType).

In short, we still have the same problems here, we've just shuffled the ordering around a little bit.  And even if you did something like "have `nsTimer::Release` set a flag for `nsTimerImpl::Release` to notice that it has to break the cycle", you still have a race (`nsTimerImpl::Release` might be past the point where it could notice the flag setting, I think?).  It's a bit of a thorny problem.
Attachment #8773018 - Flags: review?(nfroyd)
Yeah, this is a mess. Having the last ref go away somewhere other than the target thread is going to be problematic if we want that to result in cancellation of the timer, because it will be racy no matter what we do, unless we add a bunch more locking (Cancel() will race against Fire()). Right now, the last nsTimerImpl ref (that the user of the timer is holding onto) going away results in cancellation only if the timer isn't in a queue waiting to fire, and only if we manage not to race on mArmed or mCancelled.

We could try to ape this behavior by having Destroy simply remove the timer from the TimerThread, which would mean it would fire anyway if it was already in a queue, but this is a pretty weird contract.

Another option is to say that releasing the last ref to the nsITimer anywhere other than the target thread may or may not prevent the timer from popping; we could implement this by removing the timer from the timer thread and incrementing mGeneration in Destroy(), and having nsTimer release its ref to mImpl instead of the other way around.

What do you think?
Flags: needinfo?(nfroyd)
Comment on attachment 8778452 [details]
Bug 1157323 - Part 3: Do not allow mTimeout to change while a timer is in the queue.

https://reviewboard.mozilla.org/r/69754/#review71520
Attachment #8778452 - Flags: review?(nfroyd) → review+
Comment on attachment 8773018 [details]
Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release.

https://reviewboard.mozilla.org/r/65658/#review71522

I think this is safe.  The try results don't look very good, though; can you look at the failures before I review the other patches?
Attachment #8773018 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #42)
> I think this is safe.  The try results don't look very good, though; can you
> look at the failures before I review the other patches?

I mean the try results from comment 33, not the try results attached to mozreview.
Flags: needinfo?(nfroyd)
Comment on attachment 8778453 [details]
Bug 1157323 - Part 4: Stop modifying mTimeout/mDelay from the TimerThread, plus some simplification.

I'm going to cancel the review requests while I wait for try to be debugged.
Attachment #8778453 - Flags: review?(nfroyd)
Attachment #8778454 - Flags: review?(nfroyd)
Attachment #8778455 - Flags: review?(nfroyd)
Attachment #8778456 - Flags: review?(nfroyd)
The try results in comment 33 are no longer relevant, I corrected the bug there some time ago. The current try results are in reviewboard, those look good apart from known intermittents.
Flags: needinfo?(nfroyd)
(In reply to Byron Campen [:bwc] from comment #52)
> The try results in comment 33 are no longer relevant, I corrected the bug
> there some time ago. The current try results are in reviewboard, those look
> good apart from known intermittents.

My mistake, I thought I was looking at the most recent try push, but apparently not.
Flags: needinfo?(nfroyd)
Attachment #8778453 - Flags: review?(nfroyd)
Attachment #8778454 - Flags: review?(nfroyd)
Attachment #8778455 - Flags: review?(nfroyd)
Attachment #8778456 - Flags: review?(nfroyd)
The try results in comment 33 look like they've perma-oranged the clipboard tests?
Comment on attachment 8778454 [details]
Bug 1157323 - Part 5: Remove some unnecessary members.

https://reviewboard.mozilla.org/r/69758/#review79242
Attachment #8778454 - Flags: review?(nfroyd) → review+
Comment on attachment 8778455 [details]
Bug 1157323 - Part 6: A minor optimization to Cancel().

https://reviewboard.mozilla.org/r/69760/#review79244

::: xpcom/threads/nsTimerImpl.cpp:315
(Diff revision 4)
> +  // mCallbackType is never written by the TimerThread. We assume everyone
> +  // else plays nice wrt threadsafety.
> +  if (mCallbackType == CallbackType::Unknown) {

Do you known how often this optimization actually goes off?  My sense is that we're more often (almost always?) canceling timers that have already been armed, so this would be just wasted work...
Attachment #8778455 - Flags: review?(nfroyd) → review+
Comment on attachment 8778456 [details]
Bug 1157323 - Part 6: Some comment improvements.

https://reviewboard.mozilla.org/r/69762/#review79246

Every little bit of documentation helps.

::: xpcom/threads/nsTimerImpl.h:157
(Diff revision 4)
>  
>    // The generation number of this timer, re-generated each time the timer is
>    // initialized so one-shot timers can be canceled and re-initialized by the
>    // arming thread without any bad race conditions.
> +  // This is only modified on the target thread, and only after removing the
> +  // timer from the TimerThread.

And on the arming thread initially, right?
Attachment #8778456 - Flags: review?(nfroyd) → review+
Comment on attachment 8778455 [details]
Bug 1157323 - Part 6: A minor optimization to Cancel().

https://reviewboard.mozilla.org/r/69760/#review79248

::: xpcom/threads/nsTimerImpl.cpp:315
(Diff revision 4)
> +  // mCallbackType is never written by the TimerThread. We assume everyone
> +  // else plays nice wrt threadsafety.
> +  if (mCallbackType == CallbackType::Unknown) {

I'll see if tests pass without this.
Comment on attachment 8778453 [details]
Bug 1157323 - Part 4: Stop modifying mTimeout/mDelay from the TimerThread, plus some simplification.

https://reviewboard.mozilla.org/r/69756/#review79250

The changes around callback refcounting and timer firing make me a little nervous, but I think they're totally equivalent to what we had before.

Please do look into the perma-leaks in comment 54.
Attachment #8778453 - Flags: review?(nfroyd) → review+
That try push is way out of date, the most recent ones are on reviewboard.
(In reply to Byron Campen [:bwc] from comment #61)
> That try push is way out of date, the most recent ones are on reviewboard.

Sorry, you told me that already and I failed to look at the most recent ones...again.
Attachment #8778455 - Attachment is obsolete: true
Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a718fc141d0a
Part 1: Make TestTimers a unit-test. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/f814d7c18854
Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/85739737d40d
Part 3: Do not allow mTimeout to change while a timer is in the queue. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/344fdfed174a
Part 4: Stop modifying mTimeout/mDelay from the TimerThread, plus some simplification. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/f33fe0269dea
Part 5: Remove some unnecessary members. r=froydnj
https://hg.mozilla.org/integration/autoland/rev/40bb8f061490
Part 6: Some comment improvements. r=froydnj
See Also: → 1307350
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: