Closed
Bug 1141565
Opened 10 years ago
Closed 9 years ago
TSan: data race ipc/chromium/src/base/histogram.cc:730 Accumulate
Categories
(Toolkit :: Telemetry, defect)
Tracking
()
RESOLVED
FIXED
mozilla46
Tracking | Status | |
---|---|---|
firefox46 | --- | fixed |
People
(Reporter: froydnj, Assigned: jseward, NeedInfo)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [tsan][gfx-noted])
Attachments
(2 files, 3 obsolete files)
15.58 KB,
text/plain
|
Details | |
29.41 KB,
patch
|
gfritzsche
:
review+
|
Details | Diff | Splinter Review |
The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).
* Specific information about this bug
This is a race that we have always known was a possibility, but now TSan is complaining about it: writing to telemetry histograms from multiple threads.
I'm filing this under Imagelib, rather than IPC or even Toolkit::Telemetry, since Imagelib seems like the heaviest user of cross-thread histograms. (Network cache would be another, but I haven't seen cross-thread races there in TSan runs, so perhaps there's no cross-thread access to individual histograms...?) Maybe it's possible to move the histogram accesses all to the main thread?
* General information about TSan, data races, etc.
Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2].
If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.
[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Flags: needinfo?(seth)
Updated•10 years ago
|
Whiteboard: [tsan] → [tsan][gfx-noted]
Comment 1•10 years ago
|
||
Do we spew so much Telemetry data that it'd be a problem to just take a lock or use an atomic on the Telemetry side? ImageLib basically always needs to touch this stuff off-main-thread.
Flags: needinfo?(vdjeric)
Comment 2•10 years ago
|
||
(In reply to Seth Fowler [:seth] from comment #1)
> Do we spew so much Telemetry data that it'd be a problem to just take a lock
> or use an atomic on the Telemetry side? ImageLib basically always needs to
> touch this stuff off-main-thread.
There are tight loops & other hot code that calls Telemetry::Accumulate. We could look into using atomics in Telemetry, and I suspect the overhead of atomics would be acceptable, but we'd have to figure out if any histograms are negatively affected by the extra delay. I don't think it's going to be a quick fix.
For any given histogram, the samples should be accumulated on a single thread, it doesn't have to be the main thread. Does ImageLib run on more than two threads in a session?
Flags: needinfo?(vdjeric)
Comment 3•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #2)
> For any given histogram, the samples should be accumulated on a single
> thread, it doesn't have to be the main thread. Does ImageLib run on more
> than two threads in a session?
Absolutely. Image decoding happens on a thread pool; several threads can be doing it at once.
If a quick fix isn't possible on the Telemetry side we can accumulate our Telemetry and report it later on the main thread. This isn't necessarily a huge deal. I'm just trying to avoid extra synchronization with the main thread where possible in ImageLib.
Flags: needinfo?(seth)
Updated•10 years ago
|
Flags: needinfo?(seth)
Comment 4•10 years ago
|
||
Nathan, is this the same race condition as in bug 1142079?
Flags: needinfo?(nfroyd)
Reporter | ||
Comment 5•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #4)
> Nathan, is this the same race condition as in bug 1142079?
I don't see that bug 1142079 is necessarily a race condition. Asking questions in the other bug to clarify.
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 6•9 years ago
|
||
Races in histogram.cc are now amongst the most common races reported
during xpcshell tests of m-c. As of yesterday they comprise 423 reported
race summaries out of a total of 3536, that is, 12% of the total.
Assignee | ||
Comment 7•9 years ago
|
||
Easy STR: Apply test infrastructure patch from bug 1222043.
Then:
TSAN_OPTIONS=suppressions=/home/sewardj/MOZ/SUPPS/tsan-empty.txt \
DISPLAY=:1.0 ./mach xpcshell-test --sequential --log-mach - \
browser/components/places/tests/unit/test_clearHistory_shutdown.js
Assignee | ||
Comment 8•9 years ago
|
||
> (In reply to Seth Fowler [:seth] from comment #1)
> > Do we spew so much Telemetry data that it'd be a problem to just take a lock
> > or use an atomic on the Telemetry side? ImageLib basically always needs to
> > touch this stuff off-main-thread.
>
> There are tight loops & other hot code that calls Telemetry::Accumulate. [..]
I'd like to get some actual numbers on this so we can estimate how bad
the damage might be. I'll see if I can hack up a simple patch that makes
the counters atomic and/or adds more locking, and we can then measure
how many extra global bus operations we get.
Vlad, can you suggest a test that causes these tight loops to run?
Obviously, I want to measure the worst case if possible.
Flags: needinfo?(vladan.bugzilla)
Comment 9•9 years ago
|
||
I've paged a lot of this discussion out. Is there a security risk to not synchronizing on these accumulates? Telemetry was never supposed to deliver 100% accurate measurements since the data gets used in aggregate. Losing a few samples in the histograms is ok.
Two *potential* examples of frequent calls to Accumulate from bug 837271:
* WORD_CACHE_MISSES_CONTENT and WORD_CACHE_MISSES_CHROME in gfx/thebes/gfxFont.cpp
* GRADIENT_DURATION in layout/base/nsCSSRendering.cpp
I think you would be better off coming up with a synthetic test. I could also run an analysis (next week) to find the histograms with the most samples per session.
Flags: needinfo?(vladan.bugzilla)
Assignee | ||
Comment 10•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! PTO Nov 16/19/20 from comment #9)
> Is there a security risk to not synchronizing on these accumulates?
I don't know, but I think it's unlikely.
> Telemetry was never supposed to deliver 100% accurate measurements since the
> data gets used in aggregate. Losing a few samples in the histograms is ok.
This is part of a wider effort to get rid of all detectable data races in Gecko.
Even if this one doesn't seem so critical, it is pretty common (see comment 6)
and fixing it makes it easier to spot other races during testing.
I put a counter in Histogram::SampleSet::Accumulate to see how much it is
called. Roughly:
- 270 times browser start to idle, blank tab
- 530 times loading news.bbc.co.uk
- 1700 times loading techcrunch.com
At that frequency, acquiring and releasing a mutex inside Histogram::SampleSet::Accumulate
seems pretty harmless. At best that would add two global bus events per call. That's
two to three orders of magnitude less than the global bus traffic caused by malloc and
free alone, so I'd be surprised if the effect of the extra locking is even measurable.
Assignee | ||
Comment 11•9 years ago
|
||
WIP Patch. Adds a mutex to class Histogram::SampleSet and uses it to
protect all methods callable from outside the class. Temporarily
renames them with a trailing "X" to denote "externally callable".
Adds a few non-locking versions of the "X" methods, that can be used
for internal calls inside the class. These have an "I" suffix. These
are required so as to avoid deadlocking, since it's unsafe for an "X"
method to call another "X" method, as that would result in taking the
mutex twice.
This doesn't just fix the atomicity update problem on the individual
fields in SampleSet, for example on fields counts_, sum_, etc. It
allows locking at the method level, so that related fields can be
updated consistently. It's also potentially cheaper than making the
individual fields atomic, since a multi-field update operation
can be done with a single mutex acquire and release.
There's still some potential deadlock situations I need to take care
of.
Removes unused method Histogram::SampleSet::Subtract.
Patch needs to be cleaned up, the X and I suffixes removed, commented,
etc.
According to TSan, this patch removes all Histogram-related races in
the xpcshell tests.
Assignee | ||
Comment 12•9 years ago
|
||
Attachment #8691081 -
Attachment is obsolete: true
Assignee | ||
Comment 13•9 years ago
|
||
Cleaned up patch. One additional comment: it causes loss of const-ness in
various places, because Histogram::SampleSet acquires a Mutex field, and
that needs to be modified (locked) even when reading from the object. Hence
some Histogram::SampleSet methods and a few Histogram methods lose const-ness.
Attachment #8692882 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8692882 -
Attachment description: Believed to be functionally, but needs tidying up. → Believed to be functionally complete, but needs tidying up.
Assignee | ||
Updated•9 years ago
|
Attachment #8692930 -
Flags: review?(gfritzsche)
Comment 14•9 years ago
|
||
Comment on attachment 8692930 [details] [diff] [review]
Cleaned up patch
Review of attachment 8692930 [details] [diff] [review]:
-----------------------------------------------------------------
Do we have any performance tests (Talos, ...) that cover this and could be used to verify the performance impact?
::: ipc/chromium/src/base/histogram.h
@@ +340,5 @@
> + // to provide evidence that the object is locked, by supplying a
> + // const MutexAutoLock& parameter. The parameter is ignored but
> + // must be present. The non-thread-safe methods names all finish
> + // with "E" to indicate that evidence of locking must be supplied,
> + // and to make them easy to identify when reading code.
Is the "E" suffix a standard somewhere in our code?
I personally don't find it very descriptive.
Assignee | ||
Comment 15•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #14)
> Do we have any performance tests (Talos, ...) that cover this and could be
> used to verify the performance impact?
I don't know. But see also, comments 8, 9 and 10 above.
> Is the "E" suffix a standard somewhere in our code?
> I personally don't find it very descriptive.
I don't think it is standard. I intended it to mean "Evidence" (that
is, you must supply evidence that you hold the object's lock when calling).
Suggestions for a better name / scheme welcome.
Comment 16•9 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #15)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #14)
> > Do we have any performance tests (Talos, ...) that cover this and could be
> > used to verify the performance impact?
>
> I don't know. But see also, comments 8, 9 and 10 above.
I saw those comments, but i'd like us to be sure about the performance impact.
Avih, do you have good suggestions on automated or manual testing scenarios that are somewhat representative of common browsing sessions?
Flags: needinfo?(avihpit)
Comment 17•9 years ago
|
||
[Note, I didn't read all comments since I'm in a bit of a hurry now, but I still wanted to reply.]
Nothing comes to mind, other than existing tests which "exercise" the browser in general, but I'm not even sure that telemetry is enabled in talos in order to test the impact of various approaches, and even if it is, it might not be exercised enough to have a measurable impact - which might actually indicate that overall it's not meaningful.
If we want measure the performance of something concrete, we should probably write a test to measure exactly that and only that (as much as possible).
Flags: needinfo?(avihpit)
Updated•9 years ago
|
Component: ImageLib → Telemetry
Product: Core → Toolkit
Updated•9 years ago
|
Assignee: nobody → jseward
Comment 18•9 years ago
|
||
(In reply to Avi Halachmi (:avih) from comment #17)
> If we want measure the performance of something concrete, we should probably
> write a test to measure exactly that and only that (as much as possible).
The problem here is that we are checking for potential performance regressions without knowing concrete scenarios now.
Could we just run this through repeated Talos runs and check whether it triggers any regressions there?
Flags: needinfo?(avihpit)
Comment 19•9 years ago
|
||
Comment on attachment 8692930 [details] [diff] [review]
Cleaned up patch
Review of attachment 8692930 [details] [diff] [review]:
-----------------------------------------------------------------
If we can figure out how to make sure that we don't regress performance, this looks like an ok approach to me.
Style-wise:
(1) We could avoid dropping the constness if we make the mutex field mutable (unless we have some hard rule against that in our code-base?)
(2) Instead of requiring to pass the mutex to each protected call, we could instead factor the protected calls in a sub-class to do something more RAII oriented.
E.g.:
ProtectedSampleSet protected = sampleset.GetProtectedSampleSet();
foo(protected.log_sum(), ...);
Or:
Histogram::Inconsistencies check = histogram->FindCorruption(ss.GetProtectedSampleSet());
Or does that look like over-design?
::: ipc/chromium/src/base/histogram.cc
@@ -774,5 @@
> for (size_t index = 0; index < counts_.size(); ++index)
> counts_[index] += other.counts_[index];
> }
>
> -void Histogram::SampleSet::Subtract(const SampleSet& other) {
This was entirely unused?
Attachment #8692930 -
Flags: review?(gfritzsche)
Comment 20•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] [away Dec 19 - Jan 3] from comment #18)
> (In reply to Avi Halachmi (:avih) from comment #17)
> > If we want measure the performance of something concrete, we should probably
> > write a test to measure exactly that and only that (as much as possible).
>
> The problem here is that we are checking for potential performance
> regressions without knowing concrete scenarios now.
> Could we just run this through repeated Talos runs and check whether it
> triggers any regressions there?
Sure, push to try and test all platforms with 5 retriggers* with and without the questionable patch, then compare the talos results between these revisions using https://treeherder.mozilla.org/perf.html#/comparechooser
[*] try syntax: "try: -b do -p all -u all -t all --rebuild-talos 5"
Flags: needinfo?(avihpit)
Assignee | ||
Comment 21•9 years ago
|
||
> > The problem here is that we are checking for potential performance
> > regressions without knowing concrete scenarios now.
> > Could we just run this through repeated Talos runs and check whether it
> > triggers any regressions there?
>
> Sure, push to try and test all platforms with 5 retriggers* with and without
> the questionable patch, then compare the talos results between these
> revisions using https://treeherder.mozilla.org/perf.html#/comparechooser
Results at
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ede1fdfd214f&newProject=try&newRevision=9822c212f754&framework=1
There seem to be both regressions and speedups. What is the significance
of this?
Flags: needinfo?(avihpit)
Comment 22•9 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #21)
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=ede1fdfd214f&newProject=try&newR
> evision=9822c212f754&framework=1
>
> There seem to be both regressions and speedups. What is the significance
> of this?
The regressions are:
2% for dromaeo-css opt win7-32
2% for svgr-opacity e10s opt win8-64
These seem relatively minor to me, and the svgr one especially seems to me like it shouldn't be affected at all (I don't think it has many telemetry accumulation, if at all).
Overall, I'd attribute those changes to noise and as such - not significant, unless you have an explanation for the improvements/regressions and were expecting at least some of them. I guess you weren't though?
Flags: needinfo?(avihpit)
Assignee | ||
Comment 23•9 years ago
|
||
(In reply to Avi Halachmi (:avih) from comment #22)
> [..] and were
> expecting at least some of them. I guess you weren't though?
No, I wasn't expecting any change.
Assignee | ||
Comment 24•9 years ago
|
||
Another performance data point:
Running the scroll test from bug 894128 with layout.frame_rate=0
(maximum possible scrolling rate), using planet.mozilla.org, 3 times
each with/without the patch, produces the following:
In all cases: Window size: 822 x 584
Without patch: Avg Interval: 1.44 1.48 1.48 (ms)
StdDev: 0.40 0.38 0.38 (ms)
With patch: Avg Interval: 1.49 1.39 1.43 (ms)
StdDev: 0.40 0.32 0.36 (ms)
It's not obviously slower -- in fact faster, on average. So I conclude
the difference is unmeasurable at least with this test case.
Comment 25•9 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #24)
> Another performance data point:
>
> Running the scroll test...
>
> It's not obviously slower -- in fact faster, on average. So I conclude
> the difference is unmeasurable at least with this test case.
Thanks. Just adding why this result is meaningful: AFAIK, the most frequent telemetry accumulations (and therefore exercising the code at this bug) is during synchronous animation which said test case does.
Therefore it means there's no measurable regression at our worst case scenario with the new code.
Assignee | ||
Comment 26•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] [away Dec 19 - Jan 3] from comment #19)
> Comment on attachment 8692930 [details] [diff] [review]
> Style-wise:
> (1) We could avoid dropping the constness if we make the mutex field mutable
> (unless we have some hard rule against that in our code-base?)
Done. I think that restores the constness completely.
> (2) Instead of requiring to pass the mutex to each protected call, we could
> instead factor the protected calls in a sub-class to do something more RAII
> oriented.
Hmm, this seems overkill to me. Ok to skip on this?
> > -void Histogram::SampleSet::Subtract(const SampleSet& other) {
>
> This was entirely unused?
As far as I can see, yes, entirely unused.
Assignee | ||
Comment 27•9 years ago
|
||
Attachment #8692930 -
Attachment is obsolete: true
Assignee | ||
Updated•9 years ago
|
Attachment #8702156 -
Flags: review?(gfritzsche)
Comment 28•9 years ago
|
||
(In reply to Avi Halachmi (:avih) from comment #25)
> (In reply to Julian Seward [:jseward] from comment #24)
> > Another performance data point:
> >
> > Running the scroll test...
> >
> > It's not obviously slower -- in fact faster, on average. So I conclude
> > the difference is unmeasurable at least with this test case.
>
> Thanks. Just adding why this result is meaningful: AFAIK, the most frequent
> telemetry accumulations (and therefore exercising the code at this bug) is
> during synchronous animation which said test case does.
>
> Therefore it means there's no measurable regression at our worst case
> scenario with the new code.
Great, thanks for figuring this out.
Comment 29•9 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #26)
> > (2) Instead of requiring to pass the mutex to each protected call, we could
> > instead factor the protected calls in a sub-class to do something more RAII
> > oriented.
>
> Hmm, this seems overkill to me. Ok to skip on this?
Yes - this should be internal to Telemetry.cpp only, so no need for this.
Comment 30•9 years ago
|
||
Comment on attachment 8702156 [details] [diff] [review]
Rebased, and with with fixes for review comments per comment 26
Review of attachment 8702156 [details] [diff] [review]:
-----------------------------------------------------------------
::: ipc/chromium/src/base/histogram.h
@@ +340,5 @@
> + // to provide evidence that the object is locked, by supplying a
> + // const MutexAutoLock& parameter. The parameter is ignored but
> + // must be present. The non-thread-safe methods names all finish
> + // with "E" to indicate that evidence of locking must be supplied,
> + // and to make them easy to identify when reading code.
I still think the "E" suffix is not very descriptive.
We can actually just drop the suffix IMO, the additional lock argument should be sufficient.
@@ +366,5 @@
> + //
> + // The caller must hold |this.mutex_|, and must supply evidence by
> + // passing a const reference to the relevant MutexAutoLock used.
> +
> + Count countsE(const MutexAutoLock& ev, size_t i) const {
Nit: Trailing whitespace.
Attachment #8702156 -
Flags: review?(gfritzsche) → feedback+
Comment 31•9 years ago
|
||
Comment on attachment 8702156 [details] [diff] [review]
Rebased, and with with fixes for review comments per comment 26
r=me with comment 30 addressed.
Attachment #8702156 -
Flags: feedback+ → review+
Comment 32•9 years ago
|
||
Comment 33•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox46:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
You need to log in
before you can comment on or make changes to this bug.
Description
•