Closed
Bug 1157323
Opened 9 years ago
Closed 8 years ago
TSan: data races in xpcom/threads/nsTimerImpl.cpp InitCommon / GetGeneration / ReleaseTimerInternal / IsRepeatingPrecisely
Categories
(Core :: XPCOM, defect)
Tracking
()
RESOLVED
FIXED
mozilla52
Tracking | Status | |
---|---|---|
firefox52 | --- | fixed |
People
(Reporter: froydnj, Assigned: bwc)
References
(Blocks 1 open bug)
Details
(Whiteboard: [tsan])
Attachments
(7 files, 1 obsolete file)
87.58 KB,
text/plain
|
Details | |
58 bytes,
text/x-review-board-request
|
froydnj
:
review+
|
Details |
58 bytes,
text/x-review-board-request
|
froydnj
:
review+
|
Details |
58 bytes,
text/x-review-board-request
|
froydnj
:
review+
|
Details |
Bug 1157323 - Part 4: Stop modifying mTimeout/mDelay from the TimerThread, plus some simplification.
58 bytes,
text/x-review-board-request
|
froydnj
:
review+
|
Details |
58 bytes,
text/x-review-board-request
|
froydnj
:
review+
|
Details |
58 bytes,
text/x-review-board-request
|
froydnj
:
review+
|
Details |
The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer). * Specific information about this bug nsTimerImpl has a bunch of variables that aren't atomic (mType, mArmed, mGeneration), but are accessed on both the thread for the timer and the TimerThread. People have known about this, generally with some sense of "boy, that's awful!" when they look at the timer code, but now TSan is complaining about it. The locks we take might be inadvertently making these safe, but I doubt it. * General information about TSan, data races, etc. Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2]. If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist. [1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong [2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Reporter | ||
Comment 1•9 years ago
|
||
Probably more helpful to understand the races if the TSan reports are attached.
Assignee | ||
Comment 2•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/65656/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/65656/
Assignee | ||
Comment 3•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/65658/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/65658/
Assignee | ||
Comment 4•8 years ago
|
||
Anyone have a working TSan build that they can try this out with?
Assignee: nobody → docfaraday
Comment 5•8 years ago
|
||
Yeah. I do. I can try it out in the morning.
Assignee | ||
Comment 6•8 years ago
|
||
Ok, figured out my tsan goof, TestTimers looks good now with tsan. I can run some more tests.
Comment 7•8 years ago
|
||
Fixing this race sounds like a good thing. I should perhaps point out, though, I cannot reproduce it in TSan runs of the browser, either for a bit of ad-hoc surfing, or in a mochitest run. Nathan, do you have any STR for it?
Flags: needinfo?(nfroyd)
Reporter | ||
Comment 8•8 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #7) > Fixing this race sounds like a good thing. I should perhaps point out, > though, I cannot reproduce it in TSan runs of the browser, either for > a bit of ad-hoc surfing, or in a mochitest run. > > Nathan, do you have any STR for it? I have no STR, other than mochitest runs. I can't recall whether I've tried ad-hoc surfing to trigger the race or not.
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 9•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=49e042bb34e2
Comment 10•8 years ago
|
||
What is the status of this, now? Can it land?
Assignee | ||
Comment 11•8 years ago
|
||
I'm still running TSAN tests.
Assignee | ||
Updated•8 years ago
|
Attachment #8773017 -
Flags: review?(nfroyd)
Attachment #8773018 -
Flags: review?(nfroyd)
Reporter | ||
Comment 12•8 years ago
|
||
Comment on attachment 8773017 [details] Bug 1157323 - Part 1: Make TestTimers a unit-test. https://reviewboard.mozilla.org/r/65656/#review66404
Attachment #8773017 -
Flags: review?(nfroyd) → review+
Reporter | ||
Comment 13•8 years ago
|
||
Comment on attachment 8773018 [details] Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release. https://reviewboard.mozilla.org/r/65658/#review66406 Please try and separate the straight-up refactoring changes from the actual logic changes. Reviewing timer changes is hard enough as-is. Some explanation (preferably in code comments, but commit messages work too) of exactly what's being done and why that de-races things would be splendid.
Attachment #8773018 -
Flags: review?(nfroyd)
Assignee | ||
Comment 14•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/69754/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/69754/
Attachment #8773018 -
Attachment description: Bug 1157323 - Part 2: De-race nsTimerImpl. → Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release.
Attachment #8773018 -
Flags: review?(nfroyd)
Assignee | ||
Comment 15•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/69756/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/69756/
Assignee | ||
Comment 16•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/69758/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/69758/
Assignee | ||
Comment 17•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/69760/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/69760/
Assignee | ||
Comment 18•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/69762/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/69762/
Assignee | ||
Comment 19•8 years ago
|
||
Comment on attachment 8773017 [details] Bug 1157323 - Part 1: Make TestTimers a unit-test. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65656/diff/1-2/
Assignee | ||
Comment 20•8 years ago
|
||
Comment on attachment 8773018 [details] Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65658/diff/1-2/
Assignee | ||
Updated•8 years ago
|
Attachment #8778452 -
Flags: review?(nfroyd)
Attachment #8778453 -
Flags: review?(nfroyd)
Attachment #8778454 -
Flags: review?(nfroyd)
Attachment #8778455 -
Flags: review?(nfroyd)
Attachment #8778456 -
Flags: review?(nfroyd)
Reporter | ||
Comment 21•8 years ago
|
||
Thank you for splitting these apart! I'll take a look at them on Wednesday.
Reporter | ||
Comment 22•8 years ago
|
||
mozreview-review |
Comment on attachment 8773018 [details] Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release. https://reviewboard.mozilla.org/r/65658/#review68110 What did TSan say about all these patches? Did they come out clean, as per comment 11? (Not that TSan cleanness is a guarantee of de-raced code, merely that it didn't observe any races...) Still trying to swap timer state back in, preliminary comments below. ::: xpcom/threads/nsTimerImpl.h:99 (Diff revision 2) > } > } > > + void Destroy() > + { > + Cancel(); This bit and related things make me uneasy. What happens in something like the following: TimerThread: Timer goes off, TimerEvent posted to the target. TargetThread: TimerEvent starts running, firing the timer. TargetThread: Firing the timer gets past the `mCanceled` check. OtherThread: Last reference to the associated nsTimer is released, Destroy()'ing the nsTimerImpl that's in the process of firing. One of the scenarios that worries me (in addition to possible crashes) is that we can take a strong reference to `mITimer` in `nsTimerImpl::Fire` while step 3 above is in-process, which means that we're holding memory that's about to get freed from underneath us. I think this sort of thing isn't a concern with the old code because things that the timer events/timer thread holds on to are the same things as the client code. But with this new code, there's a separation, and it's possible for the two things to get out of sync. ::: xpcom/threads/nsTimerImpl.cpp:115 (Diff revision 2) > -NS_IMPL_QUERY_INTERFACE(nsTimerImpl, nsITimer) > -NS_IMPL_ADDREF(nsTimerImpl) > +NS_IMPL_QUERY_INTERFACE(nsTimer, nsITimer) > +NS_IMPL_ADDREF(nsTimer) > +NS_IMPL_RELEASE(nsTimer) This code can now be replaced by: ```NS_IMPL_ISUPPORTS(nsTimer, nsITimer) ``` now that we don't have to do funny things with Release().
Attachment #8773018 -
Flags: review?(nfroyd)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Reporter | ||
Comment 30•8 years ago
|
||
mozreview-review |
Comment on attachment 8773018 [details] Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release. https://reviewboard.mozilla.org/r/65658/#review68760 ::: xpcom/threads/nsTimerImpl.cpp:123 (Diff revisions 2 - 3) > + if (count == 1) { > + // Last ref, held by nsTimerImpl. Break the cycle. > + mImpl->Destroy(); > + return 0; > + } In the previous iteration of this patch, we had the following race: TimerThread: Timer goes off, TimerEvent posted to the target. TargetThread: TimerEvent starts running, firing the timer. TargetThread: Firing the timer gets past the mCanceled check. OtherThread: Last reference to the associated nsTimer is released, Destroy()'ing the nsTimerImpl that's in the process of firing. And, for completeness, the callstack on OtherThread at that point looked like: ``` nsTimer::Release() nsTimer::~nsTimer() nsTimerImpl::Destroy() ``` With this patch, don't we have the same problem? OtherThread releases the timer, and then we have: ``` nsTimer::Release() nsTimerImpl::Destroy() nsTimer::Release() // nulling out mITimer nsTimer::~nsTimer() ``` and we've yanked out the strong reference that `nsTimerImpl` thought it was holding out from underneath it. Just because the last reference is being held by nsTimerImpl doesn't mean that it's safe to drop it when it's the only reference left. There might be code running that still depends on that reference (e.g. the switch on the callbackType). In short, we still have the same problems here, we've just shuffled the ordering around a little bit. And even if you did something like "have `nsTimer::Release` set a flag for `nsTimerImpl::Release` to notice that it has to break the cycle", you still have a race (`nsTimerImpl::Release` might be past the point where it could notice the flag setting, I think?). It's a bit of a thorny problem.
Attachment #8773018 -
Flags: review?(nfroyd)
Assignee | ||
Comment 31•8 years ago
|
||
Yeah, this is a mess. Having the last ref go away somewhere other than the target thread is going to be problematic if we want that to result in cancellation of the timer, because it will be racy no matter what we do, unless we add a bunch more locking (Cancel() will race against Fire()). Right now, the last nsTimerImpl ref (that the user of the timer is holding onto) going away results in cancellation only if the timer isn't in a queue waiting to fire, and only if we manage not to race on mArmed or mCancelled. We could try to ape this behavior by having Destroy simply remove the timer from the TimerThread, which would mean it would fire anyway if it was already in a queue, but this is a pretty weird contract. Another option is to say that releasing the last ref to the nsITimer anywhere other than the target thread may or may not prevent the timer from popping; we could implement this by removing the timer from the timer thread and incrementing mGeneration in Destroy(), and having nsTimer release its ref to mImpl instead of the other way around. What do you think?
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 32•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7bcb55c1829
Assignee | ||
Comment 33•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=309277f56ae6
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Reporter | ||
Comment 41•8 years ago
|
||
mozreview-review |
Comment on attachment 8778452 [details] Bug 1157323 - Part 3: Do not allow mTimeout to change while a timer is in the queue. https://reviewboard.mozilla.org/r/69754/#review71520
Attachment #8778452 -
Flags: review?(nfroyd) → review+
Reporter | ||
Comment 42•8 years ago
|
||
mozreview-review |
Comment on attachment 8773018 [details] Bug 1157323 - Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release. https://reviewboard.mozilla.org/r/65658/#review71522 I think this is safe. The try results don't look very good, though; can you look at the failures before I review the other patches?
Attachment #8773018 -
Flags: review?(nfroyd) → review+
Reporter | ||
Comment 43•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #42) > I think this is safe. The try results don't look very good, though; can you > look at the failures before I review the other patches? I mean the try results from comment 33, not the try results attached to mozreview.
Flags: needinfo?(nfroyd)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Reporter | ||
Comment 51•8 years ago
|
||
Comment on attachment 8778453 [details] Bug 1157323 - Part 4: Stop modifying mTimeout/mDelay from the TimerThread, plus some simplification. I'm going to cancel the review requests while I wait for try to be debugged.
Attachment #8778453 -
Flags: review?(nfroyd)
Reporter | ||
Updated•8 years ago
|
Attachment #8778454 -
Flags: review?(nfroyd)
Reporter | ||
Updated•8 years ago
|
Attachment #8778455 -
Flags: review?(nfroyd)
Reporter | ||
Updated•8 years ago
|
Attachment #8778456 -
Flags: review?(nfroyd)
Assignee | ||
Comment 52•8 years ago
|
||
The try results in comment 33 are no longer relevant, I corrected the bug there some time ago. The current try results are in reviewboard, those look good apart from known intermittents.
Flags: needinfo?(nfroyd)
Reporter | ||
Comment 53•8 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #52) > The try results in comment 33 are no longer relevant, I corrected the bug > there some time ago. The current try results are in reviewboard, those look > good apart from known intermittents. My mistake, I thought I was looking at the most recent try push, but apparently not.
Flags: needinfo?(nfroyd)
Assignee | ||
Updated•8 years ago
|
Attachment #8778453 -
Flags: review?(nfroyd)
Assignee | ||
Updated•8 years ago
|
Attachment #8778454 -
Flags: review?(nfroyd)
Assignee | ||
Updated•8 years ago
|
Attachment #8778455 -
Flags: review?(nfroyd)
Assignee | ||
Updated•8 years ago
|
Attachment #8778456 -
Flags: review?(nfroyd)
Reporter | ||
Comment 54•8 years ago
|
||
The try results in comment 33 look like they've perma-oranged the clipboard tests?
Reporter | ||
Comment 55•8 years ago
|
||
mozreview-review |
Comment on attachment 8778454 [details] Bug 1157323 - Part 5: Remove some unnecessary members. https://reviewboard.mozilla.org/r/69758/#review79242
Attachment #8778454 -
Flags: review?(nfroyd) → review+
Reporter | ||
Comment 56•8 years ago
|
||
mozreview-review |
Comment on attachment 8778455 [details] Bug 1157323 - Part 6: A minor optimization to Cancel(). https://reviewboard.mozilla.org/r/69760/#review79244 ::: xpcom/threads/nsTimerImpl.cpp:315 (Diff revision 4) > + // mCallbackType is never written by the TimerThread. We assume everyone > + // else plays nice wrt threadsafety. > + if (mCallbackType == CallbackType::Unknown) { Do you known how often this optimization actually goes off? My sense is that we're more often (almost always?) canceling timers that have already been armed, so this would be just wasted work...
Attachment #8778455 -
Flags: review?(nfroyd) → review+
Reporter | ||
Comment 57•8 years ago
|
||
mozreview-review |
Comment on attachment 8778456 [details] Bug 1157323 - Part 6: Some comment improvements. https://reviewboard.mozilla.org/r/69762/#review79246 Every little bit of documentation helps. ::: xpcom/threads/nsTimerImpl.h:157 (Diff revision 4) > > // The generation number of this timer, re-generated each time the timer is > // initialized so one-shot timers can be canceled and re-initialized by the > // arming thread without any bad race conditions. > + // This is only modified on the target thread, and only after removing the > + // timer from the TimerThread. And on the arming thread initially, right?
Attachment #8778456 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 58•8 years ago
|
||
mozreview-review |
Comment on attachment 8778455 [details] Bug 1157323 - Part 6: A minor optimization to Cancel(). https://reviewboard.mozilla.org/r/69760/#review79248 ::: xpcom/threads/nsTimerImpl.cpp:315 (Diff revision 4) > + // mCallbackType is never written by the TimerThread. We assume everyone > + // else plays nice wrt threadsafety. > + if (mCallbackType == CallbackType::Unknown) { I'll see if tests pass without this.
Reporter | ||
Comment 59•8 years ago
|
||
mozreview-review |
Comment on attachment 8778453 [details] Bug 1157323 - Part 4: Stop modifying mTimeout/mDelay from the TimerThread, plus some simplification. https://reviewboard.mozilla.org/r/69756/#review79250 The changes around callback refcounting and timer firing make me a little nervous, but I think they're totally equivalent to what we had before. Please do look into the perma-leaks in comment 54.
Attachment #8778453 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 60•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=170baf199176
Assignee | ||
Comment 61•8 years ago
|
||
That try push is way out of date, the most recent ones are on reviewboard.
Reporter | ||
Comment 62•8 years ago
|
||
(In reply to Byron Campen [:bwc] from comment #61) > That try push is way out of date, the most recent ones are on reviewboard. Sorry, you told me that already and I failed to look at the most recent ones...again.
Assignee | ||
Comment 63•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=52c48efc174b
Assignee | ||
Comment 64•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=67755e60bab9
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Updated•8 years ago
|
Attachment #8778455 -
Attachment is obsolete: true
Comment hidden (mozreview-request) |
Comment 72•8 years ago
|
||
Pushed by bcampen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/a718fc141d0a Part 1: Make TestTimers a unit-test. r=froydnj https://hg.mozilla.org/integration/autoland/rev/f814d7c18854 Part 2: Factor nsTimerImpl into two classes, so we don't need to do racy stuff in nsTimerImpl::Release. r=froydnj https://hg.mozilla.org/integration/autoland/rev/85739737d40d Part 3: Do not allow mTimeout to change while a timer is in the queue. r=froydnj https://hg.mozilla.org/integration/autoland/rev/344fdfed174a Part 4: Stop modifying mTimeout/mDelay from the TimerThread, plus some simplification. r=froydnj https://hg.mozilla.org/integration/autoland/rev/f33fe0269dea Part 5: Remove some unnecessary members. r=froydnj https://hg.mozilla.org/integration/autoland/rev/40bb8f061490 Part 6: Some comment improvements. r=froydnj
Comment 73•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/a718fc141d0a https://hg.mozilla.org/mozilla-central/rev/f814d7c18854 https://hg.mozilla.org/mozilla-central/rev/85739737d40d https://hg.mozilla.org/mozilla-central/rev/344fdfed174a https://hg.mozilla.org/mozilla-central/rev/f33fe0269dea https://hg.mozilla.org/mozilla-central/rev/40bb8f061490
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox52:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
You need to log in
before you can comment on or make changes to this bug.
Description
•