Closed Bug 1141565 Opened 5 years ago Closed 4 years ago

TSan: data race ipc/chromium/src/base/histogram.cc:730 Accumulate

Categories

(Toolkit :: Telemetry, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox46 --- fixed

People

(Reporter: froydnj, Assigned: jseward, NeedInfo)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [tsan][gfx-noted])

Attachments

(2 files, 3 obsolete files)

The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

* Specific information about this bug

This is a race that we have always known was a possibility, but now TSan is complaining about it: writing to telemetry histograms from multiple threads.

I'm filing this under Imagelib, rather than IPC or even Toolkit::Telemetry, since Imagelib seems like the heaviest user of cross-thread histograms.  (Network cache would be another, but I haven't seen cross-thread races there in TSan runs, so perhaps there's no cross-thread access to individual histograms...?)  Maybe it's possible to move the histogram accesses all to the main thread?

* General information about TSan, data races, etc.

Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2].

If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist.

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Flags: needinfo?(seth)
Whiteboard: [tsan] → [tsan][gfx-noted]
Do we spew so much Telemetry data that it'd be a problem to just take a lock or use an atomic on the Telemetry side? ImageLib basically always needs to touch this stuff off-main-thread.
Flags: needinfo?(vdjeric)
(In reply to Seth Fowler [:seth] from comment #1)
> Do we spew so much Telemetry data that it'd be a problem to just take a lock
> or use an atomic on the Telemetry side? ImageLib basically always needs to
> touch this stuff off-main-thread.

There are tight loops & other hot code that calls Telemetry::Accumulate. We could look into using atomics in Telemetry, and I suspect the overhead of atomics would be acceptable, but we'd have to figure out if any histograms are negatively affected by the extra delay. I don't think it's going to be a quick fix.

For any given histogram, the samples should be accumulated on a single thread, it doesn't have to be the main thread. Does ImageLib run on more than two threads in a session?
Flags: needinfo?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #2)
> For any given histogram, the samples should be accumulated on a single
> thread, it doesn't have to be the main thread. Does ImageLib run on more
> than two threads in a session?

Absolutely. Image decoding happens on a thread pool; several threads can be doing it at once.

If a quick fix isn't possible on the Telemetry side we can accumulate our Telemetry and report it later on the main thread. This isn't necessarily a huge deal. I'm just trying to avoid extra synchronization with the main thread where possible in ImageLib.
Flags: needinfo?(seth)
Flags: needinfo?(seth)
Nathan, is this the same race condition as in bug 1142079?
Flags: needinfo?(nfroyd)
See Also: → 1142079
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #4)
> Nathan, is this the same race condition as in bug 1142079?

I don't see that bug 1142079 is necessarily a race condition.  Asking questions in the other bug to clarify.
Flags: needinfo?(nfroyd)
Races in histogram.cc are now amongst the most common races reported 
during xpcshell tests of m-c.  As of yesterday they comprise 423 reported
race summaries out of a total of 3536, that is, 12% of the total.
Easy STR: Apply test infrastructure patch from bug 1222043.
Then:

TSAN_OPTIONS=suppressions=/home/sewardj/MOZ/SUPPS/tsan-empty.txt \
  DISPLAY=:1.0 ./mach xpcshell-test --sequential --log-mach - \
  browser/components/places/tests/unit/test_clearHistory_shutdown.js
> (In reply to Seth Fowler [:seth] from comment #1)
> > Do we spew so much Telemetry data that it'd be a problem to just take a lock
> > or use an atomic on the Telemetry side? ImageLib basically always needs to
> > touch this stuff off-main-thread.
> 
> There are tight loops & other hot code that calls Telemetry::Accumulate. [..]

I'd like to get some actual numbers on this so we can estimate how bad
the damage might be.  I'll see if I can hack up a simple patch that makes
the counters atomic and/or adds more locking, and we can then measure
how many extra global bus operations we get.

Vlad, can you suggest a test that causes these tight loops to run?
Obviously, I want to measure the worst case if possible.
Flags: needinfo?(vladan.bugzilla)
I've paged a lot of this discussion out. Is there a security risk to not synchronizing on these accumulates? Telemetry was never supposed to deliver 100% accurate measurements since the data gets used in aggregate. Losing a few samples in the histograms is ok.

Two *potential* examples of frequent calls to Accumulate from bug 837271:
* WORD_CACHE_MISSES_CONTENT and WORD_CACHE_MISSES_CHROME in gfx/thebes/gfxFont.cpp
* GRADIENT_DURATION in layout/base/nsCSSRendering.cpp

I think you would be better off coming up with a synthetic test. I could also run an analysis (next week) to find the histograms with the most samples per session.
Flags: needinfo?(vladan.bugzilla)
(In reply to Vladan Djeric (:vladan) -- please needinfo! PTO Nov 16/19/20 from comment #9)
> Is there a security risk to not synchronizing on these accumulates?

I don't know, but I think it's unlikely.

> Telemetry was never supposed to deliver 100% accurate measurements since the
> data gets used in aggregate. Losing a few samples in the histograms is ok.

This is part of a wider effort to get rid of all detectable data races in Gecko.
Even if this one doesn't seem so critical, it is pretty common (see comment 6)
and fixing it makes it easier to spot other races during testing.

I put a counter in Histogram::SampleSet::Accumulate to see how much it is
called.  Roughly:

- 270 times browser start to idle, blank tab
- 530 times loading news.bbc.co.uk 
- 1700 times loading techcrunch.com

At that frequency, acquiring and releasing a mutex inside Histogram::SampleSet::Accumulate
seems pretty harmless.  At best that would add two global bus events per call.  That's
two to three orders of magnitude less than the global bus traffic caused by malloc and
free alone, so I'd be surprised if the effect of the extra locking is even measurable.
Attached patch WIP patch. (obsolete) — Splinter Review
WIP Patch.  Adds a mutex to class Histogram::SampleSet and uses it to
protect all methods callable from outside the class.  Temporarily
renames them with a trailing "X" to denote "externally callable".

Adds a few non-locking versions of the "X" methods, that can be used
for internal calls inside the class.  These have an "I" suffix.  These
are required so as to avoid deadlocking, since it's unsafe for an "X"
method to call another "X" method, as that would result in taking the
mutex twice.

This doesn't just fix the atomicity update problem on the individual
fields in SampleSet, for example on fields counts_, sum_, etc.  It
allows locking at the method level, so that related fields can be
updated consistently.  It's also potentially cheaper than making the
individual fields atomic, since a multi-field update operation
can be done with a single mutex acquire and release.

There's still some potential deadlock situations I need to take care
of.

Removes unused method Histogram::SampleSet::Subtract.

Patch needs to be cleaned up, the X and I suffixes removed, commented,
etc.

According to TSan, this patch removes all Histogram-related races in
the xpcshell tests.
Attachment #8691081 - Attachment is obsolete: true
Attached patch Cleaned up patch (obsolete) — Splinter Review
Cleaned up patch.  One additional comment: it causes loss of const-ness in
various places, because Histogram::SampleSet acquires a Mutex field, and
that needs to be modified (locked) even when reading from the object.  Hence
some Histogram::SampleSet methods and a few Histogram methods lose const-ness.
Attachment #8692882 - Attachment is obsolete: true
Attachment #8692882 - Attachment description: Believed to be functionally, but needs tidying up. → Believed to be functionally complete, but needs tidying up.
Attachment #8692930 - Flags: review?(gfritzsche)
Comment on attachment 8692930 [details] [diff] [review]
Cleaned up patch

Review of attachment 8692930 [details] [diff] [review]:
-----------------------------------------------------------------

Do we have any performance tests (Talos, ...) that cover this and could be used to verify the performance impact?

::: ipc/chromium/src/base/histogram.h
@@ +340,5 @@
> +    // to provide evidence that the object is locked, by supplying a
> +    // const MutexAutoLock& parameter.  The parameter is ignored but
> +    // must be present.  The non-thread-safe methods names all finish
> +    // with "E" to indicate that evidence of locking must be supplied,
> +    // and to make them easy to identify when reading code.

Is the "E" suffix a standard somewhere in our code?
I personally don't find it very descriptive.
(In reply to Georg Fritzsche [:gfritzsche] from comment #14)
> Do we have any performance tests (Talos, ...) that cover this and could be
> used to verify the performance impact?

I don't know.  But see also, comments 8, 9 and 10 above.

> Is the "E" suffix a standard somewhere in our code?
> I personally don't find it very descriptive.

I don't think it is standard.  I intended it to mean "Evidence" (that
is, you must supply evidence that you hold the object's lock when calling).
Suggestions for a better name / scheme welcome.
(In reply to Julian Seward [:jseward] from comment #15)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #14)
> > Do we have any performance tests (Talos, ...) that cover this and could be
> > used to verify the performance impact?
> 
> I don't know.  But see also, comments 8, 9 and 10 above.

I saw those comments, but i'd like us to be sure about the performance impact.
Avih, do you have good suggestions on automated or manual testing scenarios that are somewhat representative of common browsing sessions?
Flags: needinfo?(avihpit)
[Note, I didn't read all comments since I'm in a bit of a hurry now, but I still wanted to reply.]

Nothing comes to mind, other than existing tests which "exercise" the browser in general, but I'm not even sure that telemetry is enabled in talos in order to test the impact of various approaches, and even if it is, it might not be exercised enough to have a measurable impact - which might actually indicate that overall it's not meaningful.

If we want measure the performance of something concrete, we should probably write a test to measure exactly that and only that (as much as possible).
Flags: needinfo?(avihpit)
Component: ImageLib → Telemetry
Product: Core → Toolkit
Assignee: nobody → jseward
(In reply to Avi Halachmi (:avih) from comment #17)
> If we want measure the performance of something concrete, we should probably
> write a test to measure exactly that and only that (as much as possible).

The problem here is that we are checking for potential performance regressions without knowing concrete scenarios now.
Could we just run this through repeated Talos runs and check whether it triggers any regressions there?
Flags: needinfo?(avihpit)
Comment on attachment 8692930 [details] [diff] [review]
Cleaned up patch

Review of attachment 8692930 [details] [diff] [review]:
-----------------------------------------------------------------

If we can figure out how to make sure that we don't regress performance, this looks like an ok approach to me.

Style-wise:
(1) We could avoid dropping the constness if we make the mutex field mutable (unless we have some hard rule against that in our code-base?)
(2) Instead of requiring to pass the mutex to each protected call, we could instead factor the protected calls in a sub-class to do something more RAII oriented.
E.g.:
  ProtectedSampleSet protected = sampleset.GetProtectedSampleSet();
  foo(protected.log_sum(), ...);
Or:
  Histogram::Inconsistencies check = histogram->FindCorruption(ss.GetProtectedSampleSet());
Or does that look like over-design?

::: ipc/chromium/src/base/histogram.cc
@@ -774,5 @@
>    for (size_t index = 0; index < counts_.size(); ++index)
>      counts_[index] += other.counts_[index];
>  }
>  
> -void Histogram::SampleSet::Subtract(const SampleSet& other) {

This was entirely unused?
Attachment #8692930 - Flags: review?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] [away Dec 19 - Jan 3] from comment #18)
> (In reply to Avi Halachmi (:avih) from comment #17)
> > If we want measure the performance of something concrete, we should probably
> > write a test to measure exactly that and only that (as much as possible).
> 
> The problem here is that we are checking for potential performance
> regressions without knowing concrete scenarios now.
> Could we just run this through repeated Talos runs and check whether it
> triggers any regressions there?

Sure, push to try and test all platforms with 5 retriggers* with and without the questionable patch, then compare the talos results between these revisions using https://treeherder.mozilla.org/perf.html#/comparechooser

[*] try syntax: "try: -b do -p all -u all -t all --rebuild-talos 5"
Flags: needinfo?(avihpit)
Depends on: 1228147
> > The problem here is that we are checking for potential performance
> > regressions without knowing concrete scenarios now.
> > Could we just run this through repeated Talos runs and check whether it
> > triggers any regressions there?
> 
> Sure, push to try and test all platforms with 5 retriggers* with and without
> the questionable patch, then compare the talos results between these
> revisions using https://treeherder.mozilla.org/perf.html#/comparechooser

Results at

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=ede1fdfd214f&newProject=try&newRevision=9822c212f754&framework=1

There seem to be both regressions and speedups.  What is the significance
of this?
Flags: needinfo?(avihpit)
(In reply to Julian Seward [:jseward] from comment #21)
> https://treeherder.mozilla.org/perf.html#/
> compare?originalProject=try&originalRevision=ede1fdfd214f&newProject=try&newR
> evision=9822c212f754&framework=1
> 
> There seem to be both regressions and speedups.  What is the significance
> of this?

The regressions are:
2% for dromaeo-css opt win7-32
2% for svgr-opacity e10s opt win8-64

These seem relatively minor to me, and the svgr one especially seems to me like it shouldn't be affected at all (I don't think it has many telemetry accumulation, if at all).

Overall, I'd attribute those changes to noise and as such - not significant, unless you have an explanation for the improvements/regressions and were expecting at least some of them. I guess you weren't though?
Flags: needinfo?(avihpit)
(In reply to Avi Halachmi (:avih) from comment #22)
> [..] and were
> expecting at least some of them. I guess you weren't though?

No, I wasn't expecting any change.
Another performance data point:

Running the scroll test from bug 894128 with layout.frame_rate=0
(maximum possible scrolling rate), using planet.mozilla.org, 3 times
each with/without the patch, produces the following:

In all cases: Window size: 822 x 584

Without patch:  Avg Interval: 1.44 1.48 1.48 (ms)
                StdDev:       0.40 0.38 0.38 (ms)

With patch:     Avg Interval: 1.49 1.39 1.43 (ms)
                StdDev:       0.40 0.32 0.36 (ms)

It's not obviously slower -- in fact faster, on average.  So I conclude
the difference is unmeasurable at least with this test case.
(In reply to Julian Seward [:jseward] from comment #24)
> Another performance data point:
> 
> Running the scroll test...
>
> It's not obviously slower -- in fact faster, on average.  So I conclude
> the difference is unmeasurable at least with this test case.

Thanks. Just adding why this result is meaningful: AFAIK, the most frequent telemetry accumulations (and therefore exercising the code at this bug) is during synchronous animation which said test case does.

Therefore it means there's no measurable regression at our worst case scenario with the new code.
(In reply to Georg Fritzsche [:gfritzsche] [away Dec 19 - Jan 3] from comment #19)
> Comment on attachment 8692930 [details] [diff] [review]

> Style-wise:
> (1) We could avoid dropping the constness if we make the mutex field mutable
> (unless we have some hard rule against that in our code-base?)

Done.  I think that restores the constness completely.

> (2) Instead of requiring to pass the mutex to each protected call, we could
> instead factor the protected calls in a sub-class to do something more RAII
> oriented.

Hmm, this seems overkill to me.  Ok to skip on this?

> > -void Histogram::SampleSet::Subtract(const SampleSet& other) {
> 
> This was entirely unused?

As far as I can see, yes, entirely unused.
Attachment #8702156 - Flags: review?(gfritzsche)
(In reply to Avi Halachmi (:avih) from comment #25)
> (In reply to Julian Seward [:jseward] from comment #24)
> > Another performance data point:
> > 
> > Running the scroll test...
> >
> > It's not obviously slower -- in fact faster, on average.  So I conclude
> > the difference is unmeasurable at least with this test case.
> 
> Thanks. Just adding why this result is meaningful: AFAIK, the most frequent
> telemetry accumulations (and therefore exercising the code at this bug) is
> during synchronous animation which said test case does.
> 
> Therefore it means there's no measurable regression at our worst case
> scenario with the new code.

Great, thanks for figuring this out.
(In reply to Julian Seward [:jseward] from comment #26)
> > (2) Instead of requiring to pass the mutex to each protected call, we could
> > instead factor the protected calls in a sub-class to do something more RAII
> > oriented.
> 
> Hmm, this seems overkill to me.  Ok to skip on this?

Yes - this should be internal to Telemetry.cpp only, so no need for this.
Comment on attachment 8702156 [details] [diff] [review]
Rebased, and with with fixes for review comments per comment 26

Review of attachment 8702156 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/chromium/src/base/histogram.h
@@ +340,5 @@
> +    // to provide evidence that the object is locked, by supplying a
> +    // const MutexAutoLock& parameter.  The parameter is ignored but
> +    // must be present.  The non-thread-safe methods names all finish
> +    // with "E" to indicate that evidence of locking must be supplied,
> +    // and to make them easy to identify when reading code.

I still think the "E" suffix is not very descriptive.
We can actually just drop the suffix IMO, the additional lock argument should be sufficient.

@@ +366,5 @@
> +    //
> +    // The caller must hold |this.mutex_|, and must supply evidence by
> +    // passing a const reference to the relevant MutexAutoLock used.
> +
> +    Count countsE(const MutexAutoLock& ev, size_t i) const { 

Nit: Trailing whitespace.
Attachment #8702156 - Flags: review?(gfritzsche) → feedback+
Comment on attachment 8702156 [details] [diff] [review]
Rebased, and with with fixes for review comments per comment 26

r=me with comment 30 addressed.
Attachment #8702156 - Flags: feedback+ → review+
https://hg.mozilla.org/mozilla-central/rev/38fef54dc04e
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Duplicate of this bug: 1197616
You need to log in before you can comment on or make changes to this bug.