Re-factor JS-exposed TelemetryHistogram:: KEYED functions so as to avoid both races and deadlocks

RESOLVED FIXED in Firefox 62

Status

()

enhancement
P1
normal
RESOLVED FIXED
2 years ago
Last year

People

(Reporter: Dexter, Assigned: Dexter, Mentored)

Tracking

Trunk
mozilla62
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox59 wontfix, firefox60 wontfix, firefox61 wontfix, firefox62 fixed)

Details

(Whiteboard: [lang=c++][good-next-bug])

Attachments

(2 attachments, 1 obsolete attachment)

This bug is about changing the keyed histogram in TelemetryHistograms.cpp similarly to how was done in bug 1278821 for plain histograms. See below.

(In reply to Julian Seward [:jseward] from bug 1278821 comment #0)
> Copied from bug 1258183 comment 38:
> 
> Review of attachment 8757305 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/TelemetryHistogram.cpp
> @@ +1070,5 @@
> > +// deadlock because the JS_ calls that they make may call back into the
> > +// TelemetryHistogram interface, hence trying to re-acquire the mutex.
> > +//
> > +// This means that these functions potentially race against threads, but
> > +// that seems preferable to risking deadlock.
> 
> Can't we avoid deadlocks properly by re-factoring the JS-exposed functions?
> 
> In general something like:
> * initial JS calls (arg conversion, ...)
> * lock
> * internal_* calls
> * unlock
> * final JS calls (return value construction, serialization, ...)

At the very list, this bug should be about changing the |GetKeyedHistogramSnapshots| [1] so that we acquire the lock for getting a copy of the internal data and release it before calling any JS-land function.

This bug might require refactoring |GetJSSnapshot| to get a copy of the data without calling JS function and delegate the conversion to JS objects to a separate function. 

[1] - https://searchfox.org/mozilla-central/rev/137f1b2f434346a0c3756ebfcbdbee4069e15dc8/toolkit/components/telemetry/TelemetryHistogram.cpp#2264
[2] - https://searchfox.org/mozilla-central/rev/137f1b2f434346a0c3756ebfcbdbee4069e15dc8/toolkit/components/telemetry/TelemetryHistogram.cpp#903
Whiteboard: [lang=c++][good-next-bug]
Depends on: 1278821
Priority: -- → P3
Fernando, would you be interested in working on this or something slightly less involving?
Flags: needinfo?(mortificador)
(In reply to Alessio Placitelli [:Dexter] from comment #1)
> Fernando, would you be interested in working on this or something slightly
> less involving?

Hi Alessio,

Sure, I can work on this!

Thank you very much! :)
Flags: needinfo?(mortificador)
(In reply to Fernando from comment #3)
> (In reply to Alessio Placitelli [:Dexter] from comment #1)
> > Fernando, would you be interested in working on this or something slightly
> > less involving?
> 
> Hi Alessio,
> 
> Sure, I can work on this!
> 
> Thank you very much! :)

Great! Thank *you* Fernando for all your great work so far!
Assignee: nobody → mortificador
Hi Alessio,

I've been working on this, I've written code to copy the KeyedHistogram maps (there are potentially two, the histogram map and the subsession map), then I do a "shallow clear" of the KeyedHistogram maps (that's it, I call Clear on the maps, but the Histogram pointers won't be deleted, and that's fine because I have copied them to the new maps), and them I use the new maps to make the ReflectIntoJS call.

However this fix does not work, because the method GetJSSnapshot not always clear the keys:
nsresult
KeyedHistogram::GetJSSnapshot(JSContext* cx, JS::Handle<JSObject*> obj,
                              bool subsession, bool clearSubsession)
{
#if !defined(MOZ_WIDGET_ANDROID)
  KeyedHistogramMapType& map = subsession ? mSubsessionMap : mHistogramMap;
#else
  KeyedHistogramMapType& map = mHistogramMap;
#endif
  if (!map.ReflectIntoJS(&KeyedHistogram::ReflectKeyedHistogram, cx, obj)) {
    return NS_ERROR_FAILURE;
  }

#if !defined(MOZ_WIDGET_ANDROID)
  if (subsession && clearSubsession) {
    Clear(true);
  }
#endif

  return NS_OK;
}

But I'm always clearing them, and I think that's the reason why there is a test failing.

I think that a solution would be to copy the Histogram values (not just the pointers, but the values), then call Clear on the keyed histogram (all this with a lock), then release the lock and call the ReflectIntoJS on the map that holds the copied histograms. 
I haven't been very lucky with that, I'm not sure how I can copy the Histogram values (the class has the copy constructor deleted, and I haven't found a clone method or anything similar, there is a FactoryGet but it requires some arguments that I don't know where to get from, like the Sample minimum, Sample maximum, and flags).

Could you help me? If you can point me out how to copy the histogram values, then I think I know how to write the patch.

Thank you very much!
Well, we only want to call clear when:
1) Not Android
2) We're asked for the subsession histograms
3) We're asked to clear the subsession histograms

In all other cases we need to keep the histograms in their maps, unchanged. That's what ReflectIntoJS does: makes a copy of the contents of the appropriate map into `obj`

Am I understanding your question properly?
Hi Chris,

Sorry, I explained myself badly. We don't have to clear always the map (as you said, only when we are not on Android, and both the subsession and clearSubsession flags are true), but I need take the lock, copy the histogram values (not just the pointers, because other threads could mess with that data while we are making the JS calls), make the call to Clear (as said, if we are not on Android, and subsession and clearsubsession are true) then free the mutex, then make JS calls and finally, after the JS calls, I need to clear that map that I've created.

My question is how can I make a copy of the Histogram values.

Thank you very much!
Ack, I drafted a reply and then forgot to actually hit submit :S

RelectIntoJS does the copying into JSValues for you, I thought? (it fills the ranges, the counts, etc.) Do we need an extra copy?
Oh right, Alessio's fixed my POV on IRC. I now understand.

Let me think a moment about how to do the copy and I'll get back to you. Sorry for the noise.
Could we be exceedingly clever by breaking up GetJSSnapshot to no longer, ever, perform the Clear? The caller could call it if and when suitable (and we hold the lock) which should allow your original approach to work, correct?

(Copying of histograms, especially keyed histograms, is something we would like to continue to avoid if possible. If all else fails, we can write ourselves a clone method and call it something scary like ExpensiveCopyForThreadSafetyUseOnly(), but I think we might be able to use brains to get ourselves out of this)
Hi Chris,

Thank you very much for your suggestions!
I think that an approach where we don't do a copy of the Histograms would work only if we are told to clear the map (we are not in Android, and the aSubsession and aSubsessionClear flags are true), but it can give problems otherwise (I think).

If I'm not wrong, if we don't copy the Histograms (when I say "don't copy" I mean that we only copy the pointers, not the values), this is the algorithm:
- Hold the lock, copy the keyed Histogram map to another map (copying only the pointers to Histogram*), and "shallow clear" the keyed map (what I call a shallow clear is just call Clear on the map, but without deleting the pointers). This "shallow clear" is to avoid other threads messing with the data while we make the JS calls. Next to the maps, save also the KeyedHistogram* that owns the map that we just copyied (I explain the reason for this in the last step)
- Free the lock, make the JS calls
- Hold the lock again and:
  - If we have to clear (we are not on Android and aSubsession and aClearSubsession are true), delete the Histogram* from our    maps (basically, what the KeyedHistogram::Clear(true) does). This is all safe
  - If we don't have to clear the maps, then we have to copy the Histogram* back to the keyed maps (that's why we saved the KeyedHistogram* in one of the previous steps). The risk here is that the KeyedHistogram* are not valid anymore (that's why I think this approach can give problems).

I'm not sure if what I said is true or I'm missing something, if you are fine with this approach I can write a patch implementing this algorithm.

Thank you very much!
Hi Chris, Alessio,

I've drafted a patch that implements the algorithm described in my previous comment. It's just a draft, I know that some parts don't follow the mozilla code conventions, but if you are happy with the general lines of the patch, I'll fix all that.
This patch passes all the tests (mach test toolkit/components/telemetry).

Thank you very much!
Posted patch bug1430531_draft.patch (obsolete) — Splinter Review
Attachment #8946053 - Flags: review?(chutten)
Attachment #8946053 - Flags: review?(alessio.placitelli)
Comment on attachment 8946053 [details] [diff] [review]
bug1430531_draft.patch

Review of attachment 8946053 [details] [diff] [review]:
-----------------------------------------------------------------

So for the duration of the reflect, other threads will find the keyed histogram in an inconsistent state. The other thread could accumulate to that keyed histogram and that accumulation would be lost on the copy-back.

This sounds pretty bad :( I'm okay with reporting stale data, but we shouldn't destroy any of it while we go.

I think we're going to need a copy ctor after all, just to keep this manageable. Then we're back to:
1) Take the lock
2) Copy the data to locals
2.i) Clear the data if we're supposed to
3) Release the lock
4) Do the JS stuff

It's possible that the keyed histogram we return after #4 will be a little stale (a thread could come in after 3 and accumulate N values to the histogram we just copied), but that isn't really any worse than what we currently have.

Does this make sense?
Attachment #8946053 - Flags: review?(chutten)
Attachment #8946053 - Flags: review?(alessio.placitelli)
Attachment #8946053 - Flags: review-
Comment on attachment 8946053 [details] [diff] [review]
bug1430531_draft.patch

Clearing that off! Fernando, Chris has kindly agreed to mentor you on this bug, as I'm a bit swamped these days.
Attachment #8946053 - Flags: review-
Comment on attachment 8946053 [details] [diff] [review]
bug1430531_draft.patch

Sorry, I didn't mean to clear Chris' r- :)
Attachment #8946053 - Flags: review-
(In reply to Alessio Placitelli [:Dexter] from comment #15)
> Comment on attachment 8946053 [details] [diff] [review]
> bug1430531_draft.patch
> 
> Clearing that off! Fernando, Chris has kindly agreed to mentor you on this
> bug, as I'm a bit swamped these days.

No problem! Thank you :)

(In reply to Chris H-C :chutten from comment #14)
> Comment on attachment 8946053 [details] [diff] [review]
> bug1430531_draft.patch
> 
> Review of attachment 8946053 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> So for the duration of the reflect, other threads will find the keyed
> histogram in an inconsistent state. The other thread could accumulate to
> that keyed histogram and that accumulation would be lost on the copy-back.
> 
> This sounds pretty bad :( I'm okay with reporting stale data, but we
> shouldn't destroy any of it while we go.
> 
> I think we're going to need a copy ctor after all, just to keep this
> manageable. Then we're back to:
> 1) Take the lock
> 2) Copy the data to locals
> 2.i) Clear the data if we're supposed to
> 3) Release the lock
> 4) Do the JS stuff
> 
> It's possible that the keyed histogram we return after #4 will be a little
> stale (a thread could come in after 3 and accumulate N values to the
> histogram we just copied), but that isn't really any worse than what we
> currently have.
> 
> Does this make sense?

Hi Chris,

You are completely right, I didn't think of that case (other thread accumulating into a keyed histogram and that accumulation getting lost).
We can either copy the data (deep copy) as you suggested, or try to be a bit more clever when copying the data back: first check if the keyed map is empty or not, and if it's not then *add* the value that we saved. I'm just giving the idea, I'm not sure if that is something even possible with the current API.

In the case that we copy the data, how can I do it? If I'm not wrong the histogram has the copy ctor disabled.

Thank you very much!
Code to merge two sample sets like that would probably be hard to read and maintain. Let's go back to your original idea and copy the data we need.

Like the Histogram snapshotting code shows, the only data we actually need to copy and keep local is the sampleset. This is going to be a little more fiddly and involved than originally advertised. Would you like to continue with this bug now that it seems to be becoming somewhat more complicated?

If so, it looks like we'll be rewriting the keyed histogram snapshotting code to operate on the samplesets we snapshot while we hold the lock, instead of passing the JSContext down function by function.
(In reply to Chris H-C :chutten from comment #18)
> Code to merge two sample sets like that would probably be hard to read and
> maintain. Let's go back to your original idea and copy the data we need.
> 
> Like the Histogram snapshotting code shows, the only data we actually need
> to copy and keep local is the sampleset. This is going to be a little more
> fiddly and involved than originally advertised. Would you like to continue
> with this bug now that it seems to be becoming somewhat more complicated?
> 
> If so, it looks like we'll be rewriting the keyed histogram snapshotting
> code to operate on the samplesets we snapshot while we hold the lock,
> instead of passing the JSContext down function by function.

Hi Chris,

Sure, I still want to help, I will need some guide though.

Then, what I need to copy is the histograms samplesets, right? I guess that I'll have to make modifications to the KeyedHistogram class, but I'm not sure what changes (basically I don't know how to make what the call map.ReflectIntoJS does if what I have is a map of samplesets)

Thank you very much!
All ReflectIntoJS does[1] is iterate over the map and call KeyedHistogram::ReflectKeyedHistogram on the keys in the map.

KeyedHistogram::ReflectKeyedHistogram ensures the key is put into the JS object return value, pointing to the standard histogram snapshot made by internal_ReflectHistogramSnapshot (which clones out a sampleset and calls the same internal_ReflectHistogramAndSamples you saw before in bug 1278821)

So here we want to gain the lock then iterate over the map and make copies of key strings and samplesets into our own map. After that we can reuse ReflectIntoJS on this new map, but change KeyedHistogram::ReflectKeyedHistogram to some other function that takes our copied-key-string->copied-sampleset map entries instead of the key-string->Histogram* entries.

That ought to do it, I guess.

...oh, I guess we still need a Histogram* kicking around to give us the buckets and whatnot. We'll need to find some way of passing that information around. Actually... how did you handle someone changing the Histogram* out from under you between snapshotting the sample and reflecting the min,max,sum,etc. into JS? It looks to me as though some thread could change, say, the min between [1] and [2] so that the data reported to JS will be inconsistent.

[1]: https://searchfox.org/mozilla-central/rev/c56f656febb1aa58c72b218699e24a76425bd284/toolkit/components/telemetry/TelemetryHistogram.cpp#1225
[2]: https://searchfox.org/mozilla-central/rev/c56f656febb1aa58c72b218699e24a76425bd284/toolkit/components/telemetry/TelemetryHistogram.cpp#1233
(In reply to Chris H-C :chutten from comment #20)
> All ReflectIntoJS does[1] is iterate over the map and call
> KeyedHistogram::ReflectKeyedHistogram on the keys in the map.
> 
> KeyedHistogram::ReflectKeyedHistogram ensures the key is put into the JS
> object return value, pointing to the standard histogram snapshot made by
> internal_ReflectHistogramSnapshot (which clones out a sampleset and calls
> the same internal_ReflectHistogramAndSamples you saw before in bug 1278821)
> 
> So here we want to gain the lock then iterate over the map and make copies
> of key strings and samplesets into our own map. After that we can reuse
> ReflectIntoJS on this new map, but change
> KeyedHistogram::ReflectKeyedHistogram to some other function that takes our
> copied-key-string->copied-sampleset map entries instead of the
> key-string->Histogram* entries.
> 
> That ought to do it, I guess.
> 
> ...oh, I guess we still need a Histogram* kicking around to give us the
> buckets and whatnot. We'll need to find some way of passing that information
> around. Actually... how did you handle someone changing the Histogram* out
> from under you between snapshotting the sample and reflecting the
> min,max,sum,etc. into JS? It looks to me as though some thread could change,
> say, the min between [1] and [2] so that the data reported to JS will be
> inconsistent.
> 
> [1]:
> https://searchfox.org/mozilla-central/rev/
> c56f656febb1aa58c72b218699e24a76425bd284/toolkit/components/telemetry/
> TelemetryHistogram.cpp#1225
> [2]:
> https://searchfox.org/mozilla-central/rev/
> c56f656febb1aa58c72b218699e24a76425bd284/toolkit/components/telemetry/
> TelemetryHistogram.cpp#1233

Hi Chris,

Now I understand better, thank you very much for the explanation.
If I'm not wrong, that would be the algorithm:
1- Take the lock
2- Copy the key/histogram data (the sample set) to our own map
3- Release the lock
4- Do with our maps what ReflectIntoJS does

I think you are right, if someone changes the data after I copied it but before I reflect it into JS, we'll show inconsistent data, but I'm not sure that we can avoid that situation without holding locks during the JS calls.

I'll work on a patch implementing the algorithm described, thank you very much!
I have filed bug 1435812 for fixing up the non-keyed case so we make sure to copy all the mutable data before we snapshot it.
(In reply to Chris H-C :chutten from comment #23)
> I have filed bug 1435812 for fixing up the non-keyed case so we make sure to
> copy all the mutable data before we snapshot it.

Hi Chris,

Sorry for the late update, something came up and I'm not going to be able to work on this for a while, so I think it'd be for the best to unassign me from this bug, and if nobody takes it, I can come back and work on it when I have more time.

Thank you very much!
Sure, no problem!
Assignee: mortificador → nobody
Hey Chris, assign this to me, I'll give it a shot.  Haven't written any C++ code since 2002, but I like a challenge.
Sure. Let me know if you have any questions.
Assignee: nobody → joberts.ff
Status: NEW → ASSIGNED
Blocks: 1457127
Priority: P3 → P1
Chris, 

First, I just want to say thanks for helping me get going with contributing and with my first 2 bugs.

I'm going to review bug 1278821, since it's similar. Besides that is there any other information/documentation that I can review that might be helpful with this bug, or telemetry in general?

Thanks,
If you're familiar with multithreading and locks, this is a matter of refactoring these methods to protect the critical sections with a lock. Right now the accesses to unprotected (internal_*) methods are not protected when accessed via JS (because JS itself can re-enter these methods whenever it likes), so we need to reorder the operations (and likely introduce copies) to render these methods threadsafe.

If you aren't familiar with multithreading and locks, then this is probably not the best bug in which to become familiar with them :) In that case I would rather strongly recommend avoiding this particular old mess and maybe look into something like bug 1451813 which offers a better introduction to Telemetry that isn't clouded by a bunch of re-entrant code.
Sounds like a plan unassign me from this and assign me to bug 1451813. Thanks
Assignee: joberts.ff → nobody
Status: ASSIGNED → NEW
Assignee: nobody → alessio.placitelli
Attachment #8946053 - Attachment is obsolete: true
Comment on attachment 8974959 [details]
Bug 1430531 - Refactor histogram code to allow for JS-free snapshotting.

https://reviewboard.mozilla.org/r/243350/#review249288

Some small things, but the approach seems solid.

::: commit-message-b52b2:5
(Diff revision 1)
> +Bug 1430531 - Refactor histogram code to allow for JS-free snapshotting. r?chutten
> +
> +This patch introduces a couple of new functions to copy histogram data
> +to Mozilla-friendly arrays. This solves the problem of passing Histogram
> +pointers around and makes working away of JS functions easier.

*away from

::: toolkit/components/telemetry/TelemetryHistogram.cpp:702
(Diff revision 1)
> +    return NS_ERROR_FAILURE;
> +  }
> +
> +  // Fill the "ranges" property.
> +  for (size_t i = 0; i < count; i++) {
> +    if (!JS_DefineElement(cx, rarray, i, aSnapshot.mBucketRanges[i], JSPROP_ENUMERATE))

Need some braces on this

::: toolkit/components/telemetry/TelemetryHistogram.cpp:2238
(Diff revision 1)
>    // Struct used to keep information about the histograms for which a
>    // snapshot should be created
> -  struct MOZ_NON_MEMMOVABLE HistogramProcessInfo {
> -    Histogram* h;
> +  struct HistogramProcessInfo {
> +    HistogramSnapshotData data;
> -    Histogram::SampleSet ss;
>      size_t index;

It's a histogram id. We can fix this name while we're in the neighbourhood.
Attachment #8974959 - Flags: review?(chutten) → review+
Comment on attachment 8974960 [details]
Bug 1430531 - Refactor JS exposed keyed snapshotting to avoid races and deadlocks.

https://reviewboard.mozilla.org/r/243352/#review249292

Some notes

::: toolkit/components/telemetry/TelemetryHistogram.cpp:1514
(Diff revision 1)
> +
> +    // Get data for the key we're looking for.
> -  Histogram* h = nullptr;
> +    Histogram* h = nullptr;
> -  nsresult rv = keyed->GetHistogram(NS_ConvertUTF16toUTF8(key), &h);
> +    nsresult rv = keyed->GetHistogram(NS_ConvertUTF16toUTF8(key), &h);
> -  if (NS_FAILED(rv)) {
> +    if (NS_FAILED(rv)) {
> -    JS_ReportErrorASCII(cx, "Failed to get histogram");
> +      JS_ReportErrorASCII(cx, "Failed to get histogram");

Error reporting through JS may cause an operation that accumulates Telemetry, no?

::: toolkit/components/telemetry/TelemetryHistogram.cpp:2346
(Diff revision 1)
>      includeGPUProcess = gpm->AttemptedGPUProcess();
>    }
>  
> -  for (uint32_t process = 0; process < static_cast<uint32_t>(ProcessID::Count); ++process) {
> -    JS::Rooted<JSObject*> processObject(aCx, JS_NewPlainObject(aCx));
> -    if (!processObject) {
> +  struct KeyedHistogramProcessInfo {
> +    KeyedHistogramSnapshotData data;
> +    size_t index;

This is a histogram id, right? We can be more clear about that. (Do we want the type protection of Telemetry::HistogramID?)
Attachment #8974960 - Flags: review?(chutten) → review+
Comment on attachment 8974960 [details]
Bug 1430531 - Refactor JS exposed keyed snapshotting to avoid races and deadlocks.

https://reviewboard.mozilla.org/r/243352/#review249292

> Error reporting through JS may cause an operation that accumulates Telemetry, no?

Oh my, right, stupid error on my end :)
Blocks: 1461262
Pushed by alessio.placitelli@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/4589edd72ebb
Refactor histogram code to allow for JS-free snapshotting. r=chutten
https://hg.mozilla.org/integration/autoland/rev/bd328282d668
Refactor JS exposed keyed snapshotting to avoid races and deadlocks. r=chutten
https://hg.mozilla.org/mozilla-central/rev/4589edd72ebb
https://hg.mozilla.org/mozilla-central/rev/bd328282d668
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
You need to log in before you can comment on or make changes to this bug.