TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase

RESOLVED FIXED in Firefox 50

Status

()

defect
P3
normal
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: jseward, Assigned: jseward)

Tracking

(Blocks 2 bugs)

unspecified
mozilla50
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50 fixed)

Details

(Whiteboard: [measurement:client])

Attachments

(4 attachments, 8 obsolete attachments)

Assignee

Description

3 years ago
Posted file race-CanRecordBase
The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer).

* Specific information about this bug

This is effectively a case of "Publishing objects without synchronization"
as described at
https://web.archive.org/web/20150331075219/https://code.google.com/p/thread-sanitizer/wiki/PopularDataRaces

In this case, the race is on a global variable 

  TelemetryImpl* TelemetryImpl::sTelemetry,

which starts out being NULL.  One thread allocates an object and assigns
it to |sTelemetry|.  A different thread queries the object like this

  return !sTelemetry || sTelemetry->mCanRecordBase;

and there is no synchronisation at all between the two.

* General information about TSan, data races, etc.

Typically, races reported by TSan are not false positives, but it is
possible that the race is benign. Even in this case though, we should
try to come up with a fix unless this would cause unacceptable
performance issues. Also note that seemingly benign races can possibly
be harmful (also depending on the compiler and the architecture)
[1][2].

If the bug cannot be fixed, then this bug should be used to either
make a compile-time annotation for blacklisting or add an entry to the
runtime blacklist.

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Priority: -- → P3
Whiteboard: [measurement:client]
Assignee

Comment 1

3 years ago
The fact that mozilla::Telemetry isn't thread safe accounts for a
whole bunch of different races that TSan reports.  At least the
following:

11 toolkit/components/telemetry/Telemetry.cpp:3322:37 in CanRecordBase
11 toolkit/components/telemetry/Telemetry.cpp:3322:11 in CanRecordBase
 3 /usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/bits/stl_vector.h:795:32 in operator[]
 3 /usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/bits/basic_string.h:293:29 in _M_data
 3 /usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/bits/atomic_base.h:500:9 in load
 3 toolkit/components/telemetry/Telemetry.cpp:1058:18 in GetHistogramByEnumId
 3 on vptr (ctor/dtor vs virtual call) ipc/chromium/src/base/histogram.cc:128:18 in base::Histogram::Add(int)
 3 ipc/chromium/src/base/histogram.cc:392:10 in base::Histogram::bucket_count() const
 3 ipc/chromium/src/base/histogram.cc:388:10 in base::Histogram::ranges(unsigned long) const
Assignee

Comment 2

3 years ago
Posted patch bug1258183-Telemetry-3.cset (obsolete) — Splinter Review
Proof-of-concept patch.  Do not land.

Following irc conversation with gfritzsche yesterday, there are three
sets of functions which make up the external interface to
Telemetry.cpp and where we may want to add locking, so as to de-race
the module:

[1] The functions listed in toolkits/components/telemetry/Telemetry.h

[2] The functions listed in NS_DECL_NSITELEMETRY and
    NS_DECL_NSIMEMORYREPORTER in class TelemetryImpl

[3] Misc functions JSHistogram_* and JSKeyedHistogram_*

Locking is controlled for these, in this patch, by the macros
EXTERNAL_METHOD_H, EXTERNAL_METHOD_NS and EXTERNAL_METHOD_JS
respectively.  Currently locking is enabled only for [1] and [3]
because enabling locking for [2] leads to call chains of the form
Telemetry -> JS-world -> Telemetry (to report some GC statistic) and
hence it deadlocks.

The patch is huge.  Almost all of it is un-necessary, and is renaming
of the external call sites for [1] by adding "X" at the end of their
names, so I could figure out what the external interface really was.
Per irc chat yesterday this can probably be done better using
namespaces, so all of those ~ 1000 hunks should eventually disappear.

For the Telemetry.cpp changes themselves, the majority is
reorganisation of the file a bit, by adding comment lines at the start
of large sections, to make it easier to navigate.  The actual race
removal stuff proper is pretty small.
Also, as mentioned on IRC, i think we have two main categories of code in Telemetry.cpp:
(1) Histogram accumulation, control and serialization
(2) "other" Telemetry functionality that was historically accumulated in there

Pending feasibility, i would like us to treat (1) separate from (2) for locking etc.:
* i would assume that most (nearly all?) of the races are in the histogram code
* there are potentially expensive operations in (2) that are independent of the code in (1)
* the code for (2) is clearly unrelated to (1)

If it helps with reasoning etc., we could look into moving (1) out into a contained module with a clear, separate public interface.
I'm not entirely sure about the fallout, but i could try to hack this out and see if it works - if that's helpful.
I am worried about overhead. Some calls (including even some accumulations) are done on extremely hot paths. Any changes will have to measure for performance impact.

I'm not so worried about the original reported race in CanRecordBase. The code, slightly rewritten, is:

if (!sTelemetry) {
  return false;
}
return sTelemetry->mCanRecordBase;

Which looks safe to me.

As much as it pains me to suggest it, maybe we should have ThreadSafeTelemetry and ThreadUnsafeTelemetry APIs both, then accumulations and queries on hot paths with external synchronization can be fast, and JS and junk can access it without external protections?
(In reply to Chris H-C :chutten from comment #4)
> Which looks safe to me.

"Which looks safe to me since no one clears sTelemetry." I meant to write.
(In reply to Chris H-C :chutten from comment #4)
> I am worried about overhead. Some calls (including even some accumulations)
> are done on extremely hot paths. Any changes will have to measure for
> performance impact.

We already have locking around histograms in histogram.h/.cc (after rough performance tests in bug 1141565), although it would be great to have more input on performance validation for this.
If we can isolate the public API for histograms, we can move the locks up to that API level and have basically no change performance-wise.

In general, histogram accumulation should probably not happen in extremely hot paths - in case of hot loops etc. we should ideally accumulate locally and submit later/periodically.
Assignee

Comment 7

3 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> Also, as mentioned on IRC, i think we have two main categories of code in
> Telemetry.cpp:
> (1) Histogram accumulation, control and serialization
> (2) "other" Telemetry functionality that was historically accumulated in
> there
> [..]
> If it helps with reasoning etc., we could look into moving (1) out into a
> contained module with a clear, separate public interface.
> I'm not entirely sure about the fallout, but i could try to hack this out
> and see if it works - if that's helpful.

I think that would definitely be a helpful thing to do.  Telemetry.cpp is 
big and complex and intertwined, and your proposal will help untangle it.
Assignee

Comment 8

3 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> If we can isolate the public API for histograms, we can move the locks up to
> that API level and have basically no change performance-wise.

Yes.  That would be a good win.  It would make this tidying-up basically
free, from a performance point of view.
Assignee

Comment 9

3 years ago
Posted patch bug1258183-Telemetry-4.cset (obsolete) — Splinter Review
A revised patch, which is functionally the same as the previous
version, but removes the renaming of the public interface functions
from Foo to FooX, so that it is at least rebaseable.  Hence it is much
smaller than the previous patch and consists almost entirely of
changes to Telemetry.cpp.

It also contains the beginnings of playing around with namespaces to
isolate the internal vs external interfaces.  All internal functions
are now in a TelemetryPriv namespace, which is abbreviated to TP
wherever possible.

This is not intended to be landable or even the basis for a final
patch, but it is at least rebaseable -- by touching two files instead
of hundreds -- and makes it possible for me to chase other races in
Gecko whilst we resolve this bug.
Assignee: nobody → jseward
Attachment #8734618 - Attachment is obsolete: true
Depends on: 1261052
Assignee

Comment 10

3 years ago
Posted patch bug1258183-5-tidying.cset (obsolete) — Splinter Review
Rebased on top of bug 1261052, and split into two parts.  This first
part (bug1258183-5-tidying.cset) merely reorders code in Telemetry.cpp
so as to make it much where/what the external interface really is.  No
functional change.
Attachment #8736667 - Attachment is obsolete: true
Assignee

Comment 11

3 years ago
Posted patch bug1258183-5-derace.cset (obsolete) — Splinter Review
The race fix proper, now much smaller and simpler due to the
cleanups from the 5-tidying.cset patch and from bug 1261052.
Assignee

Comment 12

3 years ago
Attachment #8747052 - Attachment is obsolete: true
Assignee

Comment 13

3 years ago
Posted patch bug1258183-7-derace.cset (obsolete) — Splinter Review
Attachment #8747054 - Attachment is obsolete: true
Assignee

Comment 14

3 years ago
(In reply to Julian Seward [:jseward] from comment #12)
> Created attachment 8751279 [details] [diff] [review]
> bug1258183-7-tidying.cset

Is merely re-ordering of some functions in the file, per comment 10.
No functional change.
Assignee

Comment 15

3 years ago
(In reply to Julian Seward [:jseward] from comment #13)
> Created attachment 8751280 [details] [diff] [review]
> bug1258183-7-derace.cset

The race fix itself, per comment 11.
Assignee

Updated

3 years ago
Attachment #8751279 - Flags: review?(gfritzsche)
Assignee

Updated

3 years ago
Attachment #8751280 - Flags: review?(gfritzsche)
Assignee

Updated

3 years ago
Blocks: 1272066
Comment on attachment 8751279 [details] [diff] [review]
bug1258183-7-tidying.cset

Review of attachment 8751279 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +2399,5 @@
> +////////////////////////////////////////////////////////////////////////
> +////////////////////////////////////////////////////////////////////////
> +//
> +// EXTERNALLY VISIBLE FUNCTIONS in mozilla::
> +// These are NOT listed in Telemetry.h

Lets just move the RecordShutdownEndTimeStamp() & RecordShutdownStartTimeStamp() declarations over to Telemetry.h and into mozilla::Telemetry.
That builds fine here.

@@ +2475,5 @@
> +////////////////////////////////////////////////////////////////////////
> +////////////////////////////////////////////////////////////////////////
> +//
> +// EXTERNALLY VISIBLE FUNCTIONS in mozilla::Telemetry::
> +// These are NOT listed in Telemetry.h

So, if i got this right, the remaining ones that are public and not listed in Telemetry.h are:
ProcessedStack.h: ProcessedStack
ThreadHangStats.h: TimeHistogram, HangStack, HangHistogram, ThreadHangStats

I should fix that in a follow-up bug and maybe just properly move them out into separate modules.
Attachment #8751279 - Flags: review?(gfritzsche) → review+
Comment on attachment 8751280 [details] [diff] [review]
bug1258183-7-derace.cset

Review of attachment 8751280 [details] [diff] [review]:
-----------------------------------------------------------------

I'm cancelling this for now as we'd like to try removing the histogram.h/.cc locking here as well.
That would make it much easier to talk about performance impact (as we only have one change to look at instead of two).
Attachment #8751280 - Flags: review?(gfritzsche)
Assignee

Comment 18

3 years ago
Posted patch bug1258183-8-derace.cset (obsolete) — Splinter Review
Fix, per comment 11, but with extensions.  In total:

* adds mutex acquire/release around all functions listed in Telemetry.h
  (same as previous version of this patch)

* adds mutex acquire/release around
  TelemetryImpl::{Get,Set}CanRecord{Base,Extended}.  These are called
  from "outside" Telemetry.cpp on multiple threads and need locking to
  avoid detectable races.

* removes all locking from histogram.{cc,h}; in effect it undoes bug 1141565

* creates functions Telemetry::{Create,Destroy}StatisticsRecorder, which
  pass those requests through to TelemetryHistogram.  These create and
  destroy the singleton object without having to expose its type and so
  make it possible to remove uses of histogram.h from all places except
  TelemetryHistogram.cpp.

  Unfortunately this loses the statically-checkable-single-pointer property
  enforced by use of UniquePtr<StatisticsRecorder>.  I couldn't figure out
  how to export the type StatisticsRecorder abstractly and still make
  UniquePtr accept it, so I gave up and did it dynamically instead.

Per initial TSan testing, this seems to be just as effective at race
removal (for Telemetry) as the previous version that had locking also
in histogram.cc.
Attachment #8751280 - Attachment is obsolete: true
Assignee

Updated

3 years ago
Attachment #8753404 - Flags: feedback?(gfritzsche)
Comment on attachment 8753404 [details] [diff] [review]
bug1258183-8-derace.cset

Review of attachment 8753404 [details] [diff] [review]:
-----------------------------------------------------------------

As discussed on IRC, we want to do locking for Histogram operations in TelemetryHistogram.cpp.
One global lock in Telemetry.cpp for all Telemetry operations would e.g. block different threads accumulating into Histograms while expensive operations are ongoing (thread hangs, i/o monitoring, ...).
Longer-term i want to move the implementation of those other independent tasks out of Telemetry.cpp too, we can add independent locking for those then as needed.

Otherwise this looks good to me so far.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1486,5 @@
> +}
> +
> +void DestroyStatisticsRecorder()
> +{
> +  if (gStatisticsRecorder) {

Can we MOZ_ASSERT(gStatisticsRecorder)?

::: toolkit/xre/nsAppRunner.cpp
@@ +4566,5 @@
>    // Note: purposely leaked
> +  //base::StatisticsRecorder* statistics_recorder = base::newStatisticsRecorder();
> +  //MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT(statistics_recorder);
> +  //Unused << statistics_recorder;
> +  Telemetry::CreateStatisticsRecorder();

I think this will trigger leak detection now.
Should we mark this as intentionally leaking in TelemetryHistogram.cpp?
Attachment #8753404 - Flags: feedback?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> As discussed on IRC, we want to do locking for Histogram operations in
> TelemetryHistogram.cpp.
> One global lock in Telemetry.cpp for all Telemetry operations would e.g.
> block different threads accumulating into Histograms while expensive
> operations are ongoing (thread hangs, i/o monitoring, ...).
> Longer-term i want to move the implementation of those other independent
> tasks out of Telemetry.cpp too, we can add independent locking for those
> then as needed.

Chris, does that sound sane to you?
Flags: needinfo?(chutten)
Sounds sane. All else fails, code is mutable :)

I definitely prefer locking histograms separately from other operations. Histograms are the known race, and we definitely want to lock the least amount of code as is possible.
Flags: needinfo?(chutten)
Assignee

Comment 22

3 years ago
Posted patch bug1258183-9-derace.cset (obsolete) — Splinter Review
Per comment 19 and comment 20, a new version, with the locking moved
into the TelemetryHistogram public interface instead.  You can see
from the patch that Telemetry.cpp is (almost) entirely unchanged.

Removing the resulting deadlocks in TelemetryHistogram.cpp was quite
complex and the result is a new comment at the top of
TelemetryHistogram.cpp, which is visible in the patch.  During
development I renamed all external functions with an x_ prefix and all
internal ones with an i_ prefix, to make it easier to verify that
there are no x_ -> .. -> x_ call chains, which would lead to deadlock.
I found quite a few deadlocking paths like this, and I found three
deadlocks actually during test runs.  I removed the x_ prefixes
afterwards but retained the i_ prefixes on the internal functions
for clarity.  I imagine you will want them removed.
Attachment #8753404 - Attachment is obsolete: true
Assignee

Comment 23

3 years ago
Whilst testing this, I saw a potentially serious problem.  It exists
not only in this version of the patch but in all previous patches
(attempts to de-race Telemetry + Histogram) and may have existed
before, I am not sure.  The problem is inconsistent lock ordering, as
observed by Helgrind (I am sure) and TSan (I am pretty sure).  Like this:

This can happen

   call to TelemetryHistogram::someFn
     locks the interface mutex
        calls into JS engine
           JS engine does garbage collection
              JS engine takes some garbage collection lock

but this can also happen

   JS engine does garbage collection
      JS engine takes some garbage collection lock
         JS engine calls Telemetry to report GC stats
            Telemetry calls TelemetryHistogram
               TelemetryHistogram locks the interface mutex

So we have to mutexes (a GC mutex and the TelemetryHistogram mutex)
that get taken in an inconsistent order.  This is a possible deadlock
(imagine what happens if two threads run the two scenarios above
in parallel)
Assignee

Comment 24

3 years ago
In fact it may be even worse than that.  It may be possible even to
deadlock with just one thread, like this:

   call to TelemetryHistogram::someFn
     locks the interface mutex
        calls into JS engine
           JS engine does garbage collection
              JS engine takes some garbage collection lock
                 JS engine calls Telemetry to report GC stats
                     Telemetry calls TelemetryHistogram
                         TelemetryHistogram tries to lock the interface mutex
                             deadlock

The underlying problem is the circular dependency between Telemetry and
the JS engine.

Any suggestions?
I did not realize that telemetry used JS. The GC should probably be doing its reporting out of line via the event loop.
Assignee

Comment 26

3 years ago
(In reply to Terrence Cole [:terrence] from comment #25)
> The GC should probably be doing its reporting out of line via the event loop.

Could you give some hint(s) on how to accomplish this, please?  I am
reluctant to land this with a potential deadlock in it.  We have enough
of those already :-)
Flags: needinfo?(terrence)
Assignee

Comment 27

3 years ago
(In reply to Terrence Cole [:terrence] from comment #25)
This is Helgrind's complaint about the lock ordering.  TSan complains
the same.
The JS engine access here should be all from Telemetry serializing data to JS objects.
We should instead be able to move the JS engine calls out of the locked code for all serialization methods:
* lock
* collect the data thread-locally from shared telemetry storage, no JS engine access
* unlock
* serialize from the thread-local data via JS engine calls
* ... done
(In reply to Georg Fritzsche [:gfritzsche] from comment #28)
> The JS engine access here should be all from Telemetry serializing data to
> JS objects.
> We should instead be able to move the JS engine calls out of the locked code
> for all serialization methods:
> * lock
> * collect the data thread-locally from shared telemetry storage, no JS
> engine access
> * unlock
> * serialize from the thread-local data via JS engine calls
> * ... done

That sounds like a much simpler solution, actually, and would prevent any number of similar footguns (e.g. the HelperThreads lock) that will come up when running JS.

As to moving the GC's telemetry reporting to the event loop, I was thinking it wouldn't be that much work because we already have DebuggerOnGCRunnable. However, looking again, it seems like all of the work is done outside SpiderMonkey. So in addition to packaging up all of the data currently passed to addTelemetry in Statistics::endGC, we'd need to also find a way to put that data on the event loop. We'd probably have to make a type compatible with nsIRunnable in SpiderMonkey's jsfriendapi.h and add a callback to marshal it as a void*. Extremely ugly and unsafe.
Flags: needinfo?(terrence)
Assignee

Comment 30

3 years ago
Posted patch bug1258183-10-derace.cset (obsolete) — Splinter Review
(In reply to Georg Fritzsche [:gfritzsche] from comment #28)
Ok, here's a variant of the patch which does exactly that.  From some
initial testing this appears to be both race-free and deadlock-free
(for Telemetry stuff, obviously), but I need to test this further.

If you want to have a look at the patch, reading the big comment at the
top of TelemetryHistogram.cpp would be a good starting point.
Attachment #8755380 - Attachment is obsolete: true
Comment on attachment 8756315 [details] [diff] [review]
bug1258183-10-derace.cset

Review of attachment 8756315 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +37,5 @@
> +////////////////////////////////////////////////////////////////////////
> +//
> +// Naming: there are 3 kinds of functions in this file:
> +//
> +// * Functions named i_*: ("internal, protected"): these can only be

What do you think about using namespaces instead (procted::*, unprotected::*)?
Or if using suffixes, protected_*/unprotected_* prefixes to be less cryptic?

::: toolkit/xre/nsAppRunner.cpp
@@ +4566,5 @@
>    // thing used by Telemetry (and effectively a global; it's all static).
>    // Note: purposely leaked
> +  //base::StatisticsRecorder* statistics_recorder = base::newStatisticsRecorder();
> +  //MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT(statistics_recorder);
> +  //Unused << statistics_recorder;

Left-overs.
Comment on attachment 8756315 [details] [diff] [review]
bug1258183-10-derace.cset

Chris, can you take a higher-level look here on whether this looks ok to you too?
The comment at the top of TelemetryHistogram.cpp gives good context.

Can you also recommend performance testing we can do here to see whether this has any impact on more common browsing sessions etc.?
Attachment #8756315 - Flags: feedback?(chutten)
Comment on attachment 8756315 [details] [diff] [review]
bug1258183-10-derace.cset

Review of attachment 8756315 [details] [diff] [review]:
-----------------------------------------------------------------

So calls to i_JSHistogram functions are now unsafe. Shouldn't they be marked as iu_ instead of i_, then? They can be called directly without going through a TelemetryHistogram:: fn, which breaks the contract at the top of the file, no?

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +73,5 @@
> +//
> +// * No internal function (i_*, iu_*) may call, nor take the address
> +//   of, any function in TelemetryHistogram::*.
> +//
> +// The internal functions (i_*, iu_*) ignore locking and are thread-unsafe.

Didn't you just say above "So they have to make their own provision for thread-safety."? That isn't quite the same as saying that they just ignore thread safety entirely.

@@ +627,5 @@
>  // PRIVATE: Histogram corruption helpers
>  
>  namespace {
>  
> +// Is defined further down this file.

Is it common practice to annotate forward declarations? I would think that, being a language "feature", it would be fine to just state without comment.

@@ +1593,5 @@
> +      return;
> +    }
> +  }
> +
> +  MOZ_ASSERT(false, "Telemetry::SetHistogramRecordingEnabled(...) id not found");

MOZ_ASSERT_UNREACHABLE would be more clear, no?
Attachment #8756315 - Flags: feedback?(chutten)
Assignee

Comment 34

3 years ago
(In reply to Chris H-C :chutten from comment #33)
> Comment on attachment 8756315 [details] [diff] [review]
> bug1258183-10-derace.cset

Thank you for the review.

> So calls to i_JSHistogram functions are now unsafe. Shouldn't they be marked
> as iu_ instead of i_, then? They can be called directly without going
> through a TelemetryHistogram:: fn, which breaks the contract at the top of
> the file, no?

Yes, you are correct.  Unfortunately this is all a bit of a kludge.  What we
have is that iu_WrapAndReturnHistogram and iu_WrapAndReturnKeyedHistogram run
without protection of the mutex.  But they may access global state which is,
for other functions, protected by the mutex.  For example:

TelemetryHistogram::GetHistogramById
  (acquire mutex)
  (release mutex)
  -> iu_WrapAndReturnHistogram
     -> i_JSHistogram_Add
        -> i_CanRecordBase
           -> (access gCanRecordBase)
vs

  TelemetryHistogram::ClearHistogram
  (acquire mutex)
  -> i_CanRecordBase
      -> (access gCanRecordBase)
  (release mutex)

I can't think of an easy way to get out of this.  Fundamentally, since
Telemetry can be called from anywhere (any subsystem) within Gecko, it
can't be dependent on any other subsystems, otherwise we wind up with
these circular dependencies.  I don't think there is any other way to
do this properly.

But for sure I don't have time to completely re-engineer the
implementation right now.  So I would propose to land it as-is (modulo
review comments/fixings of course, dropping the iu_ vs i_ distinction,
and fixing up the comments) and accept that it is, at least, less
racey than it was before.  I can't actually detect any races in this
patch -- the comments above are purely from code inspection.  At least
it doesn't seem to be more deadlock-ful than it was before.


> @@ +1593,5 @@
> > +      return;
> > +    }
> > +  }
> > +
> > +  MOZ_ASSERT(false, "Telemetry::SetHistogramRecordingEnabled(...) id not found");
> 
> MOZ_ASSERT_UNREACHABLE would be more clear, no?

I can do that if you like, but I am merely copying code around here; this
is how that function was before I began this refactoring/de-racing.
Assignee

Comment 35

3 years ago
Addresses review comments in comment #33.  See also comment #34.

> What do you think about using namespaces instead (procted::*,
> unprotected::*)?

Per discussion yesterday, I couldn't get namespaces to give
safe-enough isolation.  So I've stayed with prefixes but used
internal_ instead of i_.  Per comment 34, I removed the i_ vs iu_
distinction.
Attachment #8756315 - Attachment is obsolete: true
Assignee

Updated

3 years ago
Attachment #8757305 - Flags: review?(gfritzsche)
Comment on attachment 8757305 [details] [diff] [review]
bug1258183-12-derace.cset

I'm out sick for some days.
Chris, can you take a closer look?
Attachment #8757305 - Flags: review?(chutten)
Comment on attachment 8757305 [details] [diff] [review]
bug1258183-12-derace.cset

Review of attachment 8757305 [details] [diff] [review]:
-----------------------------------------------------------------

One small problem in a comment, and then you're good to go.

I think applause is necessary here for not letting "perfect" get in the way of "better than it was". Good show.

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +76,5 @@
> +// |gTelemetryHistogramMutex| because they make calls to the JS
> +// engine, but that can in turn call back to Telemetry and hence back
> +// to a TelemetryHistogram:: function, in order to report GC and other
> +// statistics.  This would lead to deadlock due to attempted double
> +// acquisition of |gTelemetryHistogramMutex|, if the iu_* functions

Missed a s/iu_/internal_/
Attachment #8757305 - Flags: review?(chutten) → review+
Comment on attachment 8757305 [details] [diff] [review]
bug1258183-12-derace.cset

Review of attachment 8757305 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHistogram.cpp
@@ +1070,5 @@
> +// deadlock because the JS_ calls that they make may call back into the
> +// TelemetryHistogram interface, hence trying to re-acquire the mutex.
> +//
> +// This means that these functions potentially race against threads, but
> +// that seems preferable to risking deadlock.

Can't we avoid deadlocks properly by re-factoring the JS-exposed functions?

In general something like:
* initial JS calls (arg conversion, ...)
* lock
* internal_* calls
* unlock
* final JS calls (return value construction, serialization, ...)

This is presumably not a big concern right now because these functions don't show up in TSan now etc.?
Should we move this to a follow-up bug then?
Chris, did you have performance concerns here with the restructured locking?
If yes, can you think of suitable test scenarios and measurements?
Flags: needinfo?(chutten)
Having read the code, I have fewer perf concerns than I thought I would. I think a talos run with Telemetry enabled is definitely in the cards to check for system-wide changes under load.

Do we happen to have a tool to measure waits on a lock? I wouldn't mind seeing a report of how much time someone is waiting on the mutex during standard browsing tasks. I suspect it's basically 0, given our cavalier disregard of thread safety in the past, but it would be a good number to have on hand :)

I'm in favour of a follow-up bug for the JS calls.
Flags: needinfo?(chutten)
Assignee

Comment 41

3 years ago
(In reply to Chris H-C :chutten from comment #40)
> Having read the code, I have fewer perf concerns than I thought I would.

I ran a -O build on valgrind's callgrind tool, because that can count
global memory operations (the expensive part of locks/unlocks) as well as
instructions.  I let Fx start up and idle, then quit.  This is pretty
difficult because there's never really a point at which Fx goes completely
idle, but anyway.  Results are:

Without patch:
insns 15,944,543,459   global mem ops 11,028,744   (one per 1445.72 insns)

With patch
insns 16,017,032,249   global mem ops 11,044,495   (one per 1450.22 insns)

so the results are basically in the noise.  In terms of insns per global op,
the patched version is marginally better, but that could just be noise.

Having been round this loop before, for doing the locking stuff on histogram.cc
(which this patch removes :) I can say that the locking around malloc/free is
by far the largest source of global mem ops in the system, and the telemetry
induced locking was almost in the noise.  But anyway ..

> I think a talos run with Telemetry enabled is definitely in the cards to check
> for system-wide changes under load.

Do you prefer I do this before landing?  I was just about to push it.

> Do we happen to have a tool to measure waits on a lock?

I did try one of those on Linux a few years back, but I forget the details.
I can dig it up if you like.
I'm okay with you pushing before the talos run completes if you're okay with backing out the change in case something shows up :)
Assignee

Comment 43

3 years ago
(In reply to Chris H-C :chutten from comment #42)
Ok, I'll go with the Talos gig :)

I followed the same recipe as here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1141565#c20

Results are:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3b875dab3563&newProject=try&newRevision=034be7a7473b&framework=1&showOnlyImportant=0

(without patch = 3b875dab3563, with patch = 034be7a7473b)
Assignee

Comment 44

3 years ago
(In reply to Julian Seward [:jseward] from comment #43)
> (In reply to Chris H-C :chutten from comment #42)
> Ok, I'll go with the Talos gig :)

Chris, can you interpret those?  I'm not sure what is significant and what isn't.
Flags: needinfo?(chutten)
Assignee

Comment 45

3 years ago
(In reply to Julian Seward [:jseward] from comment #44)
> Chris, can you interpret those?  I'm not sure what is significant and what
> isn't.

njn reckons none of the changes shown there are statistically significant.

I also tried the scroll-speed test described at
https://bugzilla.mozilla.org/show_bug.cgi?id=1141565#c24 and #c25, which 
was suggested by Avi as a worst-case telemetry test.  

with layout.frame_rate=0 (maximum possible scrolling rate), using
planet.mozilla.org, 3 times each with/without the patch, produces the
following:

In all cases: Window size: 939 x 806

Without patch:  Avg Interval: 1.37  1.36  1.34 (ms)
                StdDev:       0.87  0.86  0.84 (ms)

With patch:     Avg Interval: 1.36  1.36  1.36 (ms)
                StdDev:       0.85  0.82  0.85 (ms)

So it seems neither obviously better nor worse.
Flags: needinfo?(chutten)
seems this caused https://treeherder.mozilla.org/logviewer.html#?job_id=29417159&repo=mozilla-inbound and so i had to bad this out, sorry!
Flags: needinfo?(jseward)
Comment on attachment 8757305 [details] [diff] [review]
bug1258183-12-derace.cset

It seems like this is settled.
Attachment #8757305 - Flags: review?(gfritzsche)

Comment 48

3 years ago
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/dd87e81a3240
TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase (part 1, tidying). r=gfritzsche.
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7481a586899
TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase (part 2, derace).  r=chutten.

Comment 49

3 years ago
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2462c9d05dbd
Backed out changeset f7481a586899 
https://hg.mozilla.org/integration/mozilla-inbound/rev/d6070dd805cf
Backed out changeset dd87e81a3240 for kraken crashes
Assignee

Comment 50

3 years ago
(In reply to Pulsebot from comment #49)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/2462c9d05dbd
> https://hg.mozilla.org/integration/mozilla-inbound/rev/d6070dd805cf
> Backed out changeset dd87e81a3240 for kraken crashes

It deadlocked, exactly as I feared in comment #24:

  call to TelemetryHistogram::someFn
    locks the interface mutex
      calls into JS engine
        JS engine does garbage collection
          JS engine takes some garbage collection lock
            JS engine calls Telemetry to report GC stats
              Telemetry calls TelemetryHistogram
                TelemetryHistogram tries to lock the interface mutex
                  deadlock

specifically

  TelemetryHistogram::CreateHistogramSnapshots [TelemetryHistogram.cpp:2122]
    internal_ReflectHistogramSnapshot [TelemetryHistogram.cpp:729]
      JS_NewArrayObject [jsapi.cpp:3307]
        js::NewDenseFullyAllocatedArray [jsarray.cpp:3429]
          js::ArrayObject::createArray [ArrayObject-inl.h:54]
            js::Allocate<JSObject, (js::AllowGC)1u> [Allocator.cpp:85]
              js::gc::GCRuntime::minorGC [jsgc.cpp:8199:33]
                js::gc::GCRuntime::minorGCImpl [jsgc.cpp:6515]
                  js::Nursery::collect [Nursery.cpp:540]
                    AccumulateTelemetryCallback [XPCJSRuntime.cpp:3254]
                      TelemetryHistogram::Accumulate [Mutex.h:164]
                        libpthread-2.15.so + 0xe89c
                          (deadlocks here and is killed by the hang detector)

I am not sure what to do about this.  Maybe it can all be refactored as Georg
suggests in comment #38, but that drags the process out even more.  On the
other hand I'm not really happy to land this while we're playing whack-a-mole
games with deadlocks.
Flags: needinfo?(jseward)
Assignee

Comment 51

3 years ago
I guess this hang is timing/scheduling dependent.  I can't reproduce it
locally, using these STR:

  cd testing/talos
  python INSTALL.py
  source bin/activate
  cd ../..

  DISPLAY=:1.0 talos -e \
    /home/sewardj/MOZ/MC-RACE/ff-Og-linux64/dist/bin/firefox \
    --develop --cycles 1 --tppagecycles 1 -a dromaeo_css

  deactivate
(In reply to Julian Seward [:jseward] from comment #50)
> It deadlocked, exactly as I feared in comment #24:
> 
>   call to TelemetryHistogram::someFn
>     locks the interface mutex
>       calls into JS engine
...
> 
> specifically
> 
>   TelemetryHistogram::CreateHistogramSnapshots [TelemetryHistogram.cpp:2122]
>     internal_ReflectHistogramSnapshot [TelemetryHistogram.cpp:729]
... 
> I am not sure what to do about this.  Maybe it can all be refactored as Georg
> suggests in comment #38, but that drags the process out even more.  On the
> other hand I'm not really happy to land this while we're playing whack-a-mole
> games with deadlocks.

I thought the conclusion here was to not lock around the JS-accessing functions and move the JS-access refactoring to a follow-up bug?

I see that there might be some changes required for TelemetryHistogram::CreateHistogramSnapshots() though as it access some racy functions.
(In reply to Georg Fritzsche [:gfritzsche] from comment #52)
...
> I see that there might be some changes required for
> TelemetryHistogram::CreateHistogramSnapshots() though as it access some racy
> functions.

Or we decide that we can temporarily live with those races.
Creating histogram snapshots is an operation that does not happen frequently.
Assignee

Comment 54

3 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #52)

> I thought the conclusion here was to not lock around the JS-accessing
> functions and move the JS-access refactoring to a follow-up bug?

Yes .. my bad.  It seems I missed
TelemetryHistogram::{Create,GetKeyed,GetAddon}HistogramSnapshots.

(In reply to Georg Fritzsche [:gfritzsche] from comment #53)

> Or we decide that we can temporarily live with those races.
> Creating histogram snapshots is an operation that does not happen frequently.

Yes.  I am inclined to go for that solution.  Fixing the above 3 functions properly
is too difficult right now.
Assignee

Updated

3 years ago
Blocks: 1278821
Assignee

Comment 55

3 years ago
(In reply to Georg Fritzsche [:gfritzsche] from comment #38)
> Can't we avoid deadlocks properly by re-factoring the JS-exposed functions?
> [..]
> Should we move this to a follow-up bug then?

Done, bug 1278821.

Comment 56

3 years ago
Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fc792a05cc49
TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase (part 1, tidying).  r=gfritzsche.
https://hg.mozilla.org/integration/mozilla-inbound/rev/1b4582536f10
TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase (part 2, derace).  r=chutten.

Comment 57

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/fc792a05cc49
https://hg.mozilla.org/mozilla-central/rev/1b4582536f10
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50

Updated

3 years ago
Depends on: 1279614
Depends on: 1338902

Updated

2 years ago
Depends on: 1350765
You need to log in before you can comment on or make changes to this bug.