Closed Bug 1258183 Opened 9 years ago Closed 8 years ago

TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase

Categories

(Toolkit :: Telemetry, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: jseward, Assigned: jseward)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [measurement:client])

Attachments

(4 files, 8 obsolete files)

Attached file race-CanRecordBase
The attached logfile shows a thread/data race detected by TSan (ThreadSanitizer). * Specific information about this bug This is effectively a case of "Publishing objects without synchronization" as described at https://web.archive.org/web/20150331075219/https://code.google.com/p/thread-sanitizer/wiki/PopularDataRaces In this case, the race is on a global variable TelemetryImpl* TelemetryImpl::sTelemetry, which starts out being NULL. One thread allocates an object and assigns it to |sTelemetry|. A different thread queries the object like this return !sTelemetry || sTelemetry->mCanRecordBase; and there is no synchronisation at all between the two. * General information about TSan, data races, etc. Typically, races reported by TSan are not false positives, but it is possible that the race is benign. Even in this case though, we should try to come up with a fix unless this would cause unacceptable performance issues. Also note that seemingly benign races can possibly be harmful (also depending on the compiler and the architecture) [1][2]. If the bug cannot be fixed, then this bug should be used to either make a compile-time annotation for blacklisting or add an entry to the runtime blacklist. [1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong [2] _How to miscompile programs with "benign" data races_: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Priority: -- → P3
Whiteboard: [measurement:client]
The fact that mozilla::Telemetry isn't thread safe accounts for a whole bunch of different races that TSan reports. At least the following: 11 toolkit/components/telemetry/Telemetry.cpp:3322:37 in CanRecordBase 11 toolkit/components/telemetry/Telemetry.cpp:3322:11 in CanRecordBase 3 /usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/bits/stl_vector.h:795:32 in operator[] 3 /usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/bits/basic_string.h:293:29 in _M_data 3 /usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/bits/atomic_base.h:500:9 in load 3 toolkit/components/telemetry/Telemetry.cpp:1058:18 in GetHistogramByEnumId 3 on vptr (ctor/dtor vs virtual call) ipc/chromium/src/base/histogram.cc:128:18 in base::Histogram::Add(int) 3 ipc/chromium/src/base/histogram.cc:392:10 in base::Histogram::bucket_count() const 3 ipc/chromium/src/base/histogram.cc:388:10 in base::Histogram::ranges(unsigned long) const
Attached patch bug1258183-Telemetry-3.cset (obsolete) — Splinter Review
Proof-of-concept patch. Do not land. Following irc conversation with gfritzsche yesterday, there are three sets of functions which make up the external interface to Telemetry.cpp and where we may want to add locking, so as to de-race the module: [1] The functions listed in toolkits/components/telemetry/Telemetry.h [2] The functions listed in NS_DECL_NSITELEMETRY and NS_DECL_NSIMEMORYREPORTER in class TelemetryImpl [3] Misc functions JSHistogram_* and JSKeyedHistogram_* Locking is controlled for these, in this patch, by the macros EXTERNAL_METHOD_H, EXTERNAL_METHOD_NS and EXTERNAL_METHOD_JS respectively. Currently locking is enabled only for [1] and [3] because enabling locking for [2] leads to call chains of the form Telemetry -> JS-world -> Telemetry (to report some GC statistic) and hence it deadlocks. The patch is huge. Almost all of it is un-necessary, and is renaming of the external call sites for [1] by adding "X" at the end of their names, so I could figure out what the external interface really was. Per irc chat yesterday this can probably be done better using namespaces, so all of those ~ 1000 hunks should eventually disappear. For the Telemetry.cpp changes themselves, the majority is reorganisation of the file a bit, by adding comment lines at the start of large sections, to make it easier to navigate. The actual race removal stuff proper is pretty small.
Also, as mentioned on IRC, i think we have two main categories of code in Telemetry.cpp: (1) Histogram accumulation, control and serialization (2) "other" Telemetry functionality that was historically accumulated in there Pending feasibility, i would like us to treat (1) separate from (2) for locking etc.: * i would assume that most (nearly all?) of the races are in the histogram code * there are potentially expensive operations in (2) that are independent of the code in (1) * the code for (2) is clearly unrelated to (1) If it helps with reasoning etc., we could look into moving (1) out into a contained module with a clear, separate public interface. I'm not entirely sure about the fallout, but i could try to hack this out and see if it works - if that's helpful.
I am worried about overhead. Some calls (including even some accumulations) are done on extremely hot paths. Any changes will have to measure for performance impact. I'm not so worried about the original reported race in CanRecordBase. The code, slightly rewritten, is: if (!sTelemetry) { return false; } return sTelemetry->mCanRecordBase; Which looks safe to me. As much as it pains me to suggest it, maybe we should have ThreadSafeTelemetry and ThreadUnsafeTelemetry APIs both, then accumulations and queries on hot paths with external synchronization can be fast, and JS and junk can access it without external protections?
(In reply to Chris H-C :chutten from comment #4) > Which looks safe to me. "Which looks safe to me since no one clears sTelemetry." I meant to write.
(In reply to Chris H-C :chutten from comment #4) > I am worried about overhead. Some calls (including even some accumulations) > are done on extremely hot paths. Any changes will have to measure for > performance impact. We already have locking around histograms in histogram.h/.cc (after rough performance tests in bug 1141565), although it would be great to have more input on performance validation for this. If we can isolate the public API for histograms, we can move the locks up to that API level and have basically no change performance-wise. In general, histogram accumulation should probably not happen in extremely hot paths - in case of hot loops etc. we should ideally accumulate locally and submit later/periodically.
(In reply to Georg Fritzsche [:gfritzsche] from comment #3) > Also, as mentioned on IRC, i think we have two main categories of code in > Telemetry.cpp: > (1) Histogram accumulation, control and serialization > (2) "other" Telemetry functionality that was historically accumulated in > there > [..] > If it helps with reasoning etc., we could look into moving (1) out into a > contained module with a clear, separate public interface. > I'm not entirely sure about the fallout, but i could try to hack this out > and see if it works - if that's helpful. I think that would definitely be a helpful thing to do. Telemetry.cpp is big and complex and intertwined, and your proposal will help untangle it.
(In reply to Georg Fritzsche [:gfritzsche] from comment #6) > If we can isolate the public API for histograms, we can move the locks up to > that API level and have basically no change performance-wise. Yes. That would be a good win. It would make this tidying-up basically free, from a performance point of view.
Attached patch bug1258183-Telemetry-4.cset (obsolete) — Splinter Review
A revised patch, which is functionally the same as the previous version, but removes the renaming of the public interface functions from Foo to FooX, so that it is at least rebaseable. Hence it is much smaller than the previous patch and consists almost entirely of changes to Telemetry.cpp. It also contains the beginnings of playing around with namespaces to isolate the internal vs external interfaces. All internal functions are now in a TelemetryPriv namespace, which is abbreviated to TP wherever possible. This is not intended to be landable or even the basis for a final patch, but it is at least rebaseable -- by touching two files instead of hundreds -- and makes it possible for me to chase other races in Gecko whilst we resolve this bug.
Assignee: nobody → jseward
Attachment #8734618 - Attachment is obsolete: true
Depends on: 1261052
Attached patch bug1258183-5-tidying.cset (obsolete) — Splinter Review
Rebased on top of bug 1261052, and split into two parts. This first part (bug1258183-5-tidying.cset) merely reorders code in Telemetry.cpp so as to make it much where/what the external interface really is. No functional change.
Attachment #8736667 - Attachment is obsolete: true
Attached patch bug1258183-5-derace.cset (obsolete) — Splinter Review
The race fix proper, now much smaller and simpler due to the cleanups from the 5-tidying.cset patch and from bug 1261052.
Attachment #8747052 - Attachment is obsolete: true
Attached patch bug1258183-7-derace.cset (obsolete) — Splinter Review
Attachment #8747054 - Attachment is obsolete: true
(In reply to Julian Seward [:jseward] from comment #12) > Created attachment 8751279 [details] [diff] [review] > bug1258183-7-tidying.cset Is merely re-ordering of some functions in the file, per comment 10. No functional change.
(In reply to Julian Seward [:jseward] from comment #13) > Created attachment 8751280 [details] [diff] [review] > bug1258183-7-derace.cset The race fix itself, per comment 11.
Attachment #8751279 - Flags: review?(gfritzsche)
Attachment #8751280 - Flags: review?(gfritzsche)
Blocks: 1272066
Comment on attachment 8751279 [details] [diff] [review] bug1258183-7-tidying.cset Review of attachment 8751279 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Telemetry.cpp @@ +2399,5 @@ > +//////////////////////////////////////////////////////////////////////// > +//////////////////////////////////////////////////////////////////////// > +// > +// EXTERNALLY VISIBLE FUNCTIONS in mozilla:: > +// These are NOT listed in Telemetry.h Lets just move the RecordShutdownEndTimeStamp() & RecordShutdownStartTimeStamp() declarations over to Telemetry.h and into mozilla::Telemetry. That builds fine here. @@ +2475,5 @@ > +//////////////////////////////////////////////////////////////////////// > +//////////////////////////////////////////////////////////////////////// > +// > +// EXTERNALLY VISIBLE FUNCTIONS in mozilla::Telemetry:: > +// These are NOT listed in Telemetry.h So, if i got this right, the remaining ones that are public and not listed in Telemetry.h are: ProcessedStack.h: ProcessedStack ThreadHangStats.h: TimeHistogram, HangStack, HangHistogram, ThreadHangStats I should fix that in a follow-up bug and maybe just properly move them out into separate modules.
Attachment #8751279 - Flags: review?(gfritzsche) → review+
Comment on attachment 8751280 [details] [diff] [review] bug1258183-7-derace.cset Review of attachment 8751280 [details] [diff] [review]: ----------------------------------------------------------------- I'm cancelling this for now as we'd like to try removing the histogram.h/.cc locking here as well. That would make it much easier to talk about performance impact (as we only have one change to look at instead of two).
Attachment #8751280 - Flags: review?(gfritzsche)
Attached patch bug1258183-8-derace.cset (obsolete) — Splinter Review
Fix, per comment 11, but with extensions. In total: * adds mutex acquire/release around all functions listed in Telemetry.h (same as previous version of this patch) * adds mutex acquire/release around TelemetryImpl::{Get,Set}CanRecord{Base,Extended}. These are called from "outside" Telemetry.cpp on multiple threads and need locking to avoid detectable races. * removes all locking from histogram.{cc,h}; in effect it undoes bug 1141565 * creates functions Telemetry::{Create,Destroy}StatisticsRecorder, which pass those requests through to TelemetryHistogram. These create and destroy the singleton object without having to expose its type and so make it possible to remove uses of histogram.h from all places except TelemetryHistogram.cpp. Unfortunately this loses the statically-checkable-single-pointer property enforced by use of UniquePtr<StatisticsRecorder>. I couldn't figure out how to export the type StatisticsRecorder abstractly and still make UniquePtr accept it, so I gave up and did it dynamically instead. Per initial TSan testing, this seems to be just as effective at race removal (for Telemetry) as the previous version that had locking also in histogram.cc.
Attachment #8751280 - Attachment is obsolete: true
Attachment #8753404 - Flags: feedback?(gfritzsche)
Comment on attachment 8753404 [details] [diff] [review] bug1258183-8-derace.cset Review of attachment 8753404 [details] [diff] [review]: ----------------------------------------------------------------- As discussed on IRC, we want to do locking for Histogram operations in TelemetryHistogram.cpp. One global lock in Telemetry.cpp for all Telemetry operations would e.g. block different threads accumulating into Histograms while expensive operations are ongoing (thread hangs, i/o monitoring, ...). Longer-term i want to move the implementation of those other independent tasks out of Telemetry.cpp too, we can add independent locking for those then as needed. Otherwise this looks good to me so far. ::: toolkit/components/telemetry/TelemetryHistogram.cpp @@ +1486,5 @@ > +} > + > +void DestroyStatisticsRecorder() > +{ > + if (gStatisticsRecorder) { Can we MOZ_ASSERT(gStatisticsRecorder)? ::: toolkit/xre/nsAppRunner.cpp @@ +4566,5 @@ > // Note: purposely leaked > + //base::StatisticsRecorder* statistics_recorder = base::newStatisticsRecorder(); > + //MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT(statistics_recorder); > + //Unused << statistics_recorder; > + Telemetry::CreateStatisticsRecorder(); I think this will trigger leak detection now. Should we mark this as intentionally leaking in TelemetryHistogram.cpp?
Attachment #8753404 - Flags: feedback?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #19) > As discussed on IRC, we want to do locking for Histogram operations in > TelemetryHistogram.cpp. > One global lock in Telemetry.cpp for all Telemetry operations would e.g. > block different threads accumulating into Histograms while expensive > operations are ongoing (thread hangs, i/o monitoring, ...). > Longer-term i want to move the implementation of those other independent > tasks out of Telemetry.cpp too, we can add independent locking for those > then as needed. Chris, does that sound sane to you?
Flags: needinfo?(chutten)
Sounds sane. All else fails, code is mutable :) I definitely prefer locking histograms separately from other operations. Histograms are the known race, and we definitely want to lock the least amount of code as is possible.
Flags: needinfo?(chutten)
Attached patch bug1258183-9-derace.cset (obsolete) — Splinter Review
Per comment 19 and comment 20, a new version, with the locking moved into the TelemetryHistogram public interface instead. You can see from the patch that Telemetry.cpp is (almost) entirely unchanged. Removing the resulting deadlocks in TelemetryHistogram.cpp was quite complex and the result is a new comment at the top of TelemetryHistogram.cpp, which is visible in the patch. During development I renamed all external functions with an x_ prefix and all internal ones with an i_ prefix, to make it easier to verify that there are no x_ -> .. -> x_ call chains, which would lead to deadlock. I found quite a few deadlocking paths like this, and I found three deadlocks actually during test runs. I removed the x_ prefixes afterwards but retained the i_ prefixes on the internal functions for clarity. I imagine you will want them removed.
Attachment #8753404 - Attachment is obsolete: true
Whilst testing this, I saw a potentially serious problem. It exists not only in this version of the patch but in all previous patches (attempts to de-race Telemetry + Histogram) and may have existed before, I am not sure. The problem is inconsistent lock ordering, as observed by Helgrind (I am sure) and TSan (I am pretty sure). Like this: This can happen call to TelemetryHistogram::someFn locks the interface mutex calls into JS engine JS engine does garbage collection JS engine takes some garbage collection lock but this can also happen JS engine does garbage collection JS engine takes some garbage collection lock JS engine calls Telemetry to report GC stats Telemetry calls TelemetryHistogram TelemetryHistogram locks the interface mutex So we have to mutexes (a GC mutex and the TelemetryHistogram mutex) that get taken in an inconsistent order. This is a possible deadlock (imagine what happens if two threads run the two scenarios above in parallel)
In fact it may be even worse than that. It may be possible even to deadlock with just one thread, like this: call to TelemetryHistogram::someFn locks the interface mutex calls into JS engine JS engine does garbage collection JS engine takes some garbage collection lock JS engine calls Telemetry to report GC stats Telemetry calls TelemetryHistogram TelemetryHistogram tries to lock the interface mutex deadlock The underlying problem is the circular dependency between Telemetry and the JS engine. Any suggestions?
I did not realize that telemetry used JS. The GC should probably be doing its reporting out of line via the event loop.
(In reply to Terrence Cole [:terrence] from comment #25) > The GC should probably be doing its reporting out of line via the event loop. Could you give some hint(s) on how to accomplish this, please? I am reluctant to land this with a potential deadlock in it. We have enough of those already :-)
Flags: needinfo?(terrence)
(In reply to Terrence Cole [:terrence] from comment #25) This is Helgrind's complaint about the lock ordering. TSan complains the same.
The JS engine access here should be all from Telemetry serializing data to JS objects. We should instead be able to move the JS engine calls out of the locked code for all serialization methods: * lock * collect the data thread-locally from shared telemetry storage, no JS engine access * unlock * serialize from the thread-local data via JS engine calls * ... done
(In reply to Georg Fritzsche [:gfritzsche] from comment #28) > The JS engine access here should be all from Telemetry serializing data to > JS objects. > We should instead be able to move the JS engine calls out of the locked code > for all serialization methods: > * lock > * collect the data thread-locally from shared telemetry storage, no JS > engine access > * unlock > * serialize from the thread-local data via JS engine calls > * ... done That sounds like a much simpler solution, actually, and would prevent any number of similar footguns (e.g. the HelperThreads lock) that will come up when running JS. As to moving the GC's telemetry reporting to the event loop, I was thinking it wouldn't be that much work because we already have DebuggerOnGCRunnable. However, looking again, it seems like all of the work is done outside SpiderMonkey. So in addition to packaging up all of the data currently passed to addTelemetry in Statistics::endGC, we'd need to also find a way to put that data on the event loop. We'd probably have to make a type compatible with nsIRunnable in SpiderMonkey's jsfriendapi.h and add a callback to marshal it as a void*. Extremely ugly and unsafe.
Flags: needinfo?(terrence)
Attached patch bug1258183-10-derace.cset (obsolete) — Splinter Review
(In reply to Georg Fritzsche [:gfritzsche] from comment #28) Ok, here's a variant of the patch which does exactly that. From some initial testing this appears to be both race-free and deadlock-free (for Telemetry stuff, obviously), but I need to test this further. If you want to have a look at the patch, reading the big comment at the top of TelemetryHistogram.cpp would be a good starting point.
Attachment #8755380 - Attachment is obsolete: true
Comment on attachment 8756315 [details] [diff] [review] bug1258183-10-derace.cset Review of attachment 8756315 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/TelemetryHistogram.cpp @@ +37,5 @@ > +//////////////////////////////////////////////////////////////////////// > +// > +// Naming: there are 3 kinds of functions in this file: > +// > +// * Functions named i_*: ("internal, protected"): these can only be What do you think about using namespaces instead (procted::*, unprotected::*)? Or if using suffixes, protected_*/unprotected_* prefixes to be less cryptic? ::: toolkit/xre/nsAppRunner.cpp @@ +4566,5 @@ > // thing used by Telemetry (and effectively a global; it's all static). > // Note: purposely leaked > + //base::StatisticsRecorder* statistics_recorder = base::newStatisticsRecorder(); > + //MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT(statistics_recorder); > + //Unused << statistics_recorder; Left-overs.
Comment on attachment 8756315 [details] [diff] [review] bug1258183-10-derace.cset Chris, can you take a higher-level look here on whether this looks ok to you too? The comment at the top of TelemetryHistogram.cpp gives good context. Can you also recommend performance testing we can do here to see whether this has any impact on more common browsing sessions etc.?
Attachment #8756315 - Flags: feedback?(chutten)
Comment on attachment 8756315 [details] [diff] [review] bug1258183-10-derace.cset Review of attachment 8756315 [details] [diff] [review]: ----------------------------------------------------------------- So calls to i_JSHistogram functions are now unsafe. Shouldn't they be marked as iu_ instead of i_, then? They can be called directly without going through a TelemetryHistogram:: fn, which breaks the contract at the top of the file, no? ::: toolkit/components/telemetry/TelemetryHistogram.cpp @@ +73,5 @@ > +// > +// * No internal function (i_*, iu_*) may call, nor take the address > +// of, any function in TelemetryHistogram::*. > +// > +// The internal functions (i_*, iu_*) ignore locking and are thread-unsafe. Didn't you just say above "So they have to make their own provision for thread-safety."? That isn't quite the same as saying that they just ignore thread safety entirely. @@ +627,5 @@ > // PRIVATE: Histogram corruption helpers > > namespace { > > +// Is defined further down this file. Is it common practice to annotate forward declarations? I would think that, being a language "feature", it would be fine to just state without comment. @@ +1593,5 @@ > + return; > + } > + } > + > + MOZ_ASSERT(false, "Telemetry::SetHistogramRecordingEnabled(...) id not found"); MOZ_ASSERT_UNREACHABLE would be more clear, no?
Attachment #8756315 - Flags: feedback?(chutten)
(In reply to Chris H-C :chutten from comment #33) > Comment on attachment 8756315 [details] [diff] [review] > bug1258183-10-derace.cset Thank you for the review. > So calls to i_JSHistogram functions are now unsafe. Shouldn't they be marked > as iu_ instead of i_, then? They can be called directly without going > through a TelemetryHistogram:: fn, which breaks the contract at the top of > the file, no? Yes, you are correct. Unfortunately this is all a bit of a kludge. What we have is that iu_WrapAndReturnHistogram and iu_WrapAndReturnKeyedHistogram run without protection of the mutex. But they may access global state which is, for other functions, protected by the mutex. For example: TelemetryHistogram::GetHistogramById (acquire mutex) (release mutex) -> iu_WrapAndReturnHistogram -> i_JSHistogram_Add -> i_CanRecordBase -> (access gCanRecordBase) vs TelemetryHistogram::ClearHistogram (acquire mutex) -> i_CanRecordBase -> (access gCanRecordBase) (release mutex) I can't think of an easy way to get out of this. Fundamentally, since Telemetry can be called from anywhere (any subsystem) within Gecko, it can't be dependent on any other subsystems, otherwise we wind up with these circular dependencies. I don't think there is any other way to do this properly. But for sure I don't have time to completely re-engineer the implementation right now. So I would propose to land it as-is (modulo review comments/fixings of course, dropping the iu_ vs i_ distinction, and fixing up the comments) and accept that it is, at least, less racey than it was before. I can't actually detect any races in this patch -- the comments above are purely from code inspection. At least it doesn't seem to be more deadlock-ful than it was before. > @@ +1593,5 @@ > > + return; > > + } > > + } > > + > > + MOZ_ASSERT(false, "Telemetry::SetHistogramRecordingEnabled(...) id not found"); > > MOZ_ASSERT_UNREACHABLE would be more clear, no? I can do that if you like, but I am merely copying code around here; this is how that function was before I began this refactoring/de-racing.
Addresses review comments in comment #33. See also comment #34. > What do you think about using namespaces instead (procted::*, > unprotected::*)? Per discussion yesterday, I couldn't get namespaces to give safe-enough isolation. So I've stayed with prefixes but used internal_ instead of i_. Per comment 34, I removed the i_ vs iu_ distinction.
Attachment #8756315 - Attachment is obsolete: true
Attachment #8757305 - Flags: review?(gfritzsche)
Comment on attachment 8757305 [details] [diff] [review] bug1258183-12-derace.cset I'm out sick for some days. Chris, can you take a closer look?
Attachment #8757305 - Flags: review?(chutten)
Comment on attachment 8757305 [details] [diff] [review] bug1258183-12-derace.cset Review of attachment 8757305 [details] [diff] [review]: ----------------------------------------------------------------- One small problem in a comment, and then you're good to go. I think applause is necessary here for not letting "perfect" get in the way of "better than it was". Good show. ::: toolkit/components/telemetry/TelemetryHistogram.cpp @@ +76,5 @@ > +// |gTelemetryHistogramMutex| because they make calls to the JS > +// engine, but that can in turn call back to Telemetry and hence back > +// to a TelemetryHistogram:: function, in order to report GC and other > +// statistics. This would lead to deadlock due to attempted double > +// acquisition of |gTelemetryHistogramMutex|, if the iu_* functions Missed a s/iu_/internal_/
Attachment #8757305 - Flags: review?(chutten) → review+
Comment on attachment 8757305 [details] [diff] [review] bug1258183-12-derace.cset Review of attachment 8757305 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/TelemetryHistogram.cpp @@ +1070,5 @@ > +// deadlock because the JS_ calls that they make may call back into the > +// TelemetryHistogram interface, hence trying to re-acquire the mutex. > +// > +// This means that these functions potentially race against threads, but > +// that seems preferable to risking deadlock. Can't we avoid deadlocks properly by re-factoring the JS-exposed functions? In general something like: * initial JS calls (arg conversion, ...) * lock * internal_* calls * unlock * final JS calls (return value construction, serialization, ...) This is presumably not a big concern right now because these functions don't show up in TSan now etc.? Should we move this to a follow-up bug then?
Chris, did you have performance concerns here with the restructured locking? If yes, can you think of suitable test scenarios and measurements?
Flags: needinfo?(chutten)
Having read the code, I have fewer perf concerns than I thought I would. I think a talos run with Telemetry enabled is definitely in the cards to check for system-wide changes under load. Do we happen to have a tool to measure waits on a lock? I wouldn't mind seeing a report of how much time someone is waiting on the mutex during standard browsing tasks. I suspect it's basically 0, given our cavalier disregard of thread safety in the past, but it would be a good number to have on hand :) I'm in favour of a follow-up bug for the JS calls.
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #40) > Having read the code, I have fewer perf concerns than I thought I would. I ran a -O build on valgrind's callgrind tool, because that can count global memory operations (the expensive part of locks/unlocks) as well as instructions. I let Fx start up and idle, then quit. This is pretty difficult because there's never really a point at which Fx goes completely idle, but anyway. Results are: Without patch: insns 15,944,543,459 global mem ops 11,028,744 (one per 1445.72 insns) With patch insns 16,017,032,249 global mem ops 11,044,495 (one per 1450.22 insns) so the results are basically in the noise. In terms of insns per global op, the patched version is marginally better, but that could just be noise. Having been round this loop before, for doing the locking stuff on histogram.cc (which this patch removes :) I can say that the locking around malloc/free is by far the largest source of global mem ops in the system, and the telemetry induced locking was almost in the noise. But anyway .. > I think a talos run with Telemetry enabled is definitely in the cards to check > for system-wide changes under load. Do you prefer I do this before landing? I was just about to push it. > Do we happen to have a tool to measure waits on a lock? I did try one of those on Linux a few years back, but I forget the details. I can dig it up if you like.
I'm okay with you pushing before the talos run completes if you're okay with backing out the change in case something shows up :)
(In reply to Chris H-C :chutten from comment #42) Ok, I'll go with the Talos gig :) I followed the same recipe as here: https://bugzilla.mozilla.org/show_bug.cgi?id=1141565#c20 Results are: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3b875dab3563&newProject=try&newRevision=034be7a7473b&framework=1&showOnlyImportant=0 (without patch = 3b875dab3563, with patch = 034be7a7473b)
(In reply to Julian Seward [:jseward] from comment #43) > (In reply to Chris H-C :chutten from comment #42) > Ok, I'll go with the Talos gig :) Chris, can you interpret those? I'm not sure what is significant and what isn't.
Flags: needinfo?(chutten)
(In reply to Julian Seward [:jseward] from comment #44) > Chris, can you interpret those? I'm not sure what is significant and what > isn't. njn reckons none of the changes shown there are statistically significant. I also tried the scroll-speed test described at https://bugzilla.mozilla.org/show_bug.cgi?id=1141565#c24 and #c25, which was suggested by Avi as a worst-case telemetry test. with layout.frame_rate=0 (maximum possible scrolling rate), using planet.mozilla.org, 3 times each with/without the patch, produces the following: In all cases: Window size: 939 x 806 Without patch: Avg Interval: 1.37 1.36 1.34 (ms) StdDev: 0.87 0.86 0.84 (ms) With patch: Avg Interval: 1.36 1.36 1.36 (ms) StdDev: 0.85 0.82 0.85 (ms) So it seems neither obviously better nor worse.
Flags: needinfo?(chutten)
Flags: needinfo?(jseward)
Comment on attachment 8757305 [details] [diff] [review] bug1258183-12-derace.cset It seems like this is settled.
Attachment #8757305 - Flags: review?(gfritzsche)
Pushed by jseward@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/dd87e81a3240 TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase (part 1, tidying). r=gfritzsche. https://hg.mozilla.org/integration/mozilla-inbound/rev/f7481a586899 TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase (part 2, derace). r=chutten.
Backout by cbook@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/2462c9d05dbd Backed out changeset f7481a586899 https://hg.mozilla.org/integration/mozilla-inbound/rev/d6070dd805cf Backed out changeset dd87e81a3240 for kraken crashes
(In reply to Pulsebot from comment #49) > https://hg.mozilla.org/integration/mozilla-inbound/rev/2462c9d05dbd > https://hg.mozilla.org/integration/mozilla-inbound/rev/d6070dd805cf > Backed out changeset dd87e81a3240 for kraken crashes It deadlocked, exactly as I feared in comment #24: call to TelemetryHistogram::someFn locks the interface mutex calls into JS engine JS engine does garbage collection JS engine takes some garbage collection lock JS engine calls Telemetry to report GC stats Telemetry calls TelemetryHistogram TelemetryHistogram tries to lock the interface mutex deadlock specifically TelemetryHistogram::CreateHistogramSnapshots [TelemetryHistogram.cpp:2122] internal_ReflectHistogramSnapshot [TelemetryHistogram.cpp:729] JS_NewArrayObject [jsapi.cpp:3307] js::NewDenseFullyAllocatedArray [jsarray.cpp:3429] js::ArrayObject::createArray [ArrayObject-inl.h:54] js::Allocate<JSObject, (js::AllowGC)1u> [Allocator.cpp:85] js::gc::GCRuntime::minorGC [jsgc.cpp:8199:33] js::gc::GCRuntime::minorGCImpl [jsgc.cpp:6515] js::Nursery::collect [Nursery.cpp:540] AccumulateTelemetryCallback [XPCJSRuntime.cpp:3254] TelemetryHistogram::Accumulate [Mutex.h:164] libpthread-2.15.so + 0xe89c (deadlocks here and is killed by the hang detector) I am not sure what to do about this. Maybe it can all be refactored as Georg suggests in comment #38, but that drags the process out even more. On the other hand I'm not really happy to land this while we're playing whack-a-mole games with deadlocks.
Flags: needinfo?(jseward)
I guess this hang is timing/scheduling dependent. I can't reproduce it locally, using these STR: cd testing/talos python INSTALL.py source bin/activate cd ../.. DISPLAY=:1.0 talos -e \ /home/sewardj/MOZ/MC-RACE/ff-Og-linux64/dist/bin/firefox \ --develop --cycles 1 --tppagecycles 1 -a dromaeo_css deactivate
(In reply to Julian Seward [:jseward] from comment #50) > It deadlocked, exactly as I feared in comment #24: > > call to TelemetryHistogram::someFn > locks the interface mutex > calls into JS engine ... > > specifically > > TelemetryHistogram::CreateHistogramSnapshots [TelemetryHistogram.cpp:2122] > internal_ReflectHistogramSnapshot [TelemetryHistogram.cpp:729] ... > I am not sure what to do about this. Maybe it can all be refactored as Georg > suggests in comment #38, but that drags the process out even more. On the > other hand I'm not really happy to land this while we're playing whack-a-mole > games with deadlocks. I thought the conclusion here was to not lock around the JS-accessing functions and move the JS-access refactoring to a follow-up bug? I see that there might be some changes required for TelemetryHistogram::CreateHistogramSnapshots() though as it access some racy functions.
(In reply to Georg Fritzsche [:gfritzsche] from comment #52) ... > I see that there might be some changes required for > TelemetryHistogram::CreateHistogramSnapshots() though as it access some racy > functions. Or we decide that we can temporarily live with those races. Creating histogram snapshots is an operation that does not happen frequently.
(In reply to Georg Fritzsche [:gfritzsche] from comment #52) > I thought the conclusion here was to not lock around the JS-accessing > functions and move the JS-access refactoring to a follow-up bug? Yes .. my bad. It seems I missed TelemetryHistogram::{Create,GetKeyed,GetAddon}HistogramSnapshots. (In reply to Georg Fritzsche [:gfritzsche] from comment #53) > Or we decide that we can temporarily live with those races. > Creating histogram snapshots is an operation that does not happen frequently. Yes. I am inclined to go for that solution. Fixing the above 3 functions properly is too difficult right now.
Blocks: 1278821
(In reply to Georg Fritzsche [:gfritzsche] from comment #38) > Can't we avoid deadlocks properly by re-factoring the JS-exposed functions? > [..] > Should we move this to a follow-up bug then? Done, bug 1278821.
Pushed by jseward@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/fc792a05cc49 TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase (part 1, tidying). r=gfritzsche. https://hg.mozilla.org/integration/mozilla-inbound/rev/1b4582536f10 TSan: data race toolkit/components/telemetry/Telemetry.cpp in CanRecordBase (part 2, derace). r=chutten.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Depends on: 1279614
Depends on: 1350765
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: