Duplicate telemetry histograms for double submission

RESOLVED FIXED in Firefox 39

Status

()

Toolkit
Telemetry
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: gfritzsche, Assigned: gfritzsche)

Tracking

Trunk
mozilla39
Points:
---

Firefox Tracking Flags

(firefox39 fixed)

Details

(Whiteboard: [ready])

Attachments

(3 attachments, 8 obsolete attachments)

15.99 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
13.30 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
15.11 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
(Assignee)

Description

3 years ago
Per the discussion on the FHR/Telemetry unification concerns [0] we want to go with client-side submission of both "reset per subsession" and "old style telemetry" histograms.

This means that we have to duplicate histograms.
Bug 1126867 showed that this has acceptable memory costs.

0:"Suggestions for the new unified FHR/Telemetry/Experiment ping", https://mail.mozilla.org/pipermail/fhr-dev/2015-January/thread.html#382
(Assignee)

Updated

3 years ago
Assignee: nobody → gfritzsche
(Assignee)

Updated

3 years ago
Blocks: 1127919
(Assignee)

Comment 1

3 years ago
Created attachment 8558148 [details] [diff] [review]
Part 1: Duplicate keyed histograms

This duplicates keyed histogram data and allows to reset and retrieve the duplicate separately.

Notes:
* I'm keeping this to the minimum required for handling duplicates in TelemetrySession.jsm
* I don't intend to document the duplicate features as this is only for a temporary telemetry-internal use-case
Attachment #8558148 - Flags: review?(vdjeric)
(Assignee)

Updated

3 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 2

3 years ago
Created attachment 8558159 [details] [diff] [review]
Part 1: Duplicate keyed histograms

Fixed minor oversight.
Attachment #8558148 - Attachment is obsolete: true
Attachment #8558148 - Flags: review?(vdjeric)
Attachment #8558159 - Flags: review?(vdjeric)
(Assignee)

Comment 3

3 years ago
Created attachment 8558646 [details] [diff] [review]
Part 2: Duplicate normal histograms
Attachment #8558646 - Flags: review?(vdjeric)
(Assignee)

Comment 4

3 years ago
Created attachment 8559251 [details] [diff] [review]
Part 3: Submit duplicate histograms for non-classic payload reasons

This still has some test issues, but gives context on the usage of part 1 & 2.
Comment on attachment 8558646 [details] [diff] [review]
Part 2: Duplicate normal histograms

Review of attachment 8558646 [details] [diff] [review]:
-----------------------------------------------------------------

Given that we expect subsession histograms to eventually completely replace the full-session histograms, shouldn't we make the full-session histograms be the "duplicates"?

::: toolkit/components/telemetry/Telemetry.cpp
@@ +993,5 @@
>  }
>  
> +Histogram*
> +CloneHistogram(const nsACString& newName, Telemetry::ID existingId,
> +               Histogram& existing)

- add a comment header for these functions explaining what they do & the motivation

@@ +995,5 @@
> +Histogram*
> +CloneHistogram(const nsACString& newName, Telemetry::ID existingId,
> +               Histogram& existing)
> +{
> +  const TelemetryHistogram &p = gHistograms[existingId];

- I'd prefer a check for nullness of gHistograms[existingId] here, rather than in this function's callers
- pick a more descriptive name than p

@@ +1009,5 @@
> +  }
> +
> +  Histogram::SampleSet ss;
> +  existing.SnapshotSample(&ss);
> +  clone->AddSampleSet(ss);

There's a chance of a race condition here, SnapshotSample() is not protected with any locks!

http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/histogram.cc#389

This could cause the sub-session histogram to miss samples added to the full-session histogram.
This wasn't a big deal before because HistogramFrom was only used to create snapshots of histograms at startup. It gets more sensitive for the new use case

@@ +1019,5 @@
> +CloneHistogram(const nsACString& newName, Telemetry::ID existingId)
> +{
> +  Histogram *existing = nullptr;
> +  nsresult rv = GetHistogramByEnumId(existingId, &existing);
> +  if (NS_FAILED(rv)) {

why not put this check into the other CloneHistogram function? That way we'll know the check is done even if some future code calls the 3-parameter version of CloneHistogram directly

@@ +1027,5 @@
> +  return CloneHistogram(newName, existingId, *existing);
> +}
> +
> +Histogram*
> +GetDuplicateHistogram(Histogram& existing)

same comment about the "duplicate" name

@@ +1057,5 @@
> +nsresult
> +HistogramAdd(Histogram& histogram, int32_t value)
> +{
> +  histogram.Add(value);
> +  if (Histogram* duplicate = GetDuplicateHistogram(histogram)) {

I feel like something very visible should happen if we're not able to create a duplicate histogram (e.g. an assert)

@@ +2194,5 @@
> +    Histogram* original = h;
> +    if (duplicates) {
> +      h = GetDuplicateHistogram(*h);
> +      if (!h) {
> +        continue;

why would we want to continue here?

::: toolkit/components/telemetry/nsITelemetry.idl
@@ +58,5 @@
>    /**
> +   * As histogramSnapshots, except this contains the internally duplicated histograms for subsession telemetry.
> +   */
> +  [implicit_jscontext]
> +  readonly attribute jsval duplicateHistogramSnapshots;

Duplicate is ambiguous, especially since it's not clear which is which. Let's use another name, maybe "subsessionHistogramSnapshots" or "dailyHistogramSnapshots".

::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
@@ +604,5 @@
>    Assert.ok(!registered.has("TELEMETRY_TEST_KEYED_FLAG"));
>    Assert.ok(registered.has("TELEMETRY_TEST_KEYED_RELEASE_OPTOUT"));
>  }
>  
> +function test_duplicates() {

Add another test for calling clear() when there's data in both snapshots

@@ +630,5 @@
> +  h.clear(true);
> +  h.add(1);
> +  snapshot = Telemetry.histogramSnapshots;
> +  duplicates = Telemetry.duplicateHistogramSnapshots;
> +  Assert.deepEqual(original[ID], duplicates[ID]);

this is a little hard to parse.. why not just check for value of the histogram's .sum field directly?

@@ +637,5 @@
> +  // Increasing the count again means the duplicate should now match the second snapshot.
> +  h.add(1);
> +  snapshot = Telemetry.histogramSnapshots;
> +  duplicates = Telemetry.duplicateHistogramSnapshots;
> +  Assert.deepEqual(original2[ID], duplicates[ID]);

ditto
Attachment #8558646 - Flags: review?(vdjeric)
Comment on attachment 8558159 [details] [diff] [review]
Part 1: Duplicate keyed histograms

Review of attachment 8558159 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +3877,5 @@
> +nsresult
> +KeyedHistogram::Add(const nsCString& key, uint32_t sample)
> +{
> +  if (!TelemetryImpl::CanRecord()) {
> +    return NS_ERROR_FAILURE;

doesn't this mean we're going to report a JS error and throw an exception(?) if Telemetry is turned off?

@@ +3883,5 @@
> +
> +  Histogram* histogram = GetHistogram(key, false);
> +  Histogram* duplicate = GetHistogram(key, true);
> +  if (!histogram || !duplicate) {
> +    return NS_ERROR_FAILURE;

in general, i don't like throwing exceptions from Telemetry code.. they're nevercaught by the callers and they're hard to spot when it starts happening. Asserts (or segfaults for OOMs) would be preferable
Attachment #8558159 - Flags: review?(vdjeric)
(Assignee)

Comment 7

3 years ago
(In reply to Vladan Djeric (:vladan) from comment #5)
> Given that we expect subsession histograms to eventually completely replace
> the full-session histograms, shouldn't we make the full-session histograms
> be the "duplicates"?

I don't think that matters much, we have to remove one code-path either way and only submit/reset one of them.

> @@ +1009,5 @@
> > +  }
> > +
> > +  Histogram::SampleSet ss;
> > +  existing.SnapshotSample(&ss);
> > +  clone->AddSampleSet(ss);
> 
> There's a chance of a race condition here, SnapshotSample() is not protected
> with any locks!
> 
> http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/
> histogram.cc#389
> 
> This could cause the sub-session histogram to miss samples added to the
> full-session histogram.
> This wasn't a big deal before because HistogramFrom was only used to create
> snapshots of histograms at startup. It gets more sensitive for the new use
> case

I vaguely remember that histograms and the telemetry API should only be used from the same thread. Is that not the case anymore?
I'm asking because the cloning i added is only triggered from the histogram accumulate/add code paths.

If we need this, where would you prefer to protect this? Synchronizing Telemetry call paths or on the histograms themselves?


> @@ +1019,5 @@
> > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId)
> > +{
> > +  Histogram *existing = nullptr;
> > +  nsresult rv = GetHistogramByEnumId(existingId, &existing);
> > +  if (NS_FAILED(rv)) {
> 
> why not put this check into the other CloneHistogram function? That way
> we'll know the check is done even if some future code calls the 3-parameter
> version of CloneHistogram directly

I did not want to call GetHistogramByEnumId() unnecessarily - one caller / call path already has the histogram, so the lookup is not needed.

> @@ +2194,5 @@
> > +    Histogram* original = h;
> > +    if (duplicates) {
> > +      h = GetDuplicateHistogram(*h);
> > +      if (!h) {
> > +        continue;
> 
> why would we want to continue here?

Because keyed histograms follow a different code path and we don't get duplicates for the existing duplicate histograms - GetDuplicateHistogram() fails for them.
(Assignee)

Comment 8

3 years ago
needinfo? on the open questions in comment 7.
Flags: needinfo?(vdjeric)
(Assignee)

Comment 9

3 years ago
(In reply to Vladan Djeric (:vladan) from comment #5)
> @@ +1057,5 @@
> > +nsresult
> > +HistogramAdd(Histogram& histogram, int32_t value)
> > +{
> > +  histogram.Add(value);
> > +  if (Histogram* duplicate = GetDuplicateHistogram(histogram)) {
> 
> I feel like something very visible should happen if we're not able to create
> a duplicate histogram (e.g. an assert)

For simplicity the additions are limited to registered histograms.
For all other histograms, GetDuplicateHistogram() (or now GetSubsessionHistogram()), returns null, which is not a failure.
We should handle failure cases in GetSubsessionHistogram().
(Assignee)

Comment 10

3 years ago
Created attachment 8562368 [details] [diff] [review]
Part 1: Duplicate keyed histograms

Addressed feedback except the lock protection - i would handle that in a separate patch depending on the answer to the open question.
Attachment #8558159 - Attachment is obsolete: true
Attachment #8562368 - Flags: review?(vdjeric)
(Assignee)

Comment 11

3 years ago
Created attachment 8562373 [details] [diff] [review]
Part 2: Duplicate normal histograms

Addressed feedback except the lock protection - i would handle that in a separate patch depending on the answer to the open question.
Attachment #8558646 - Attachment is obsolete: true
Attachment #8562373 - Flags: review?(vdjeric)
(Assignee)

Comment 12

3 years ago
Created attachment 8562426 [details] [diff] [review]
Part 3: Submit duplicate histograms for non-classic payload reasons

This takes care of submitting the duplicates for all session reasons other than the "classic"/"full-session" ones.
Attachment #8559251 - Attachment is obsolete: true
Attachment #8562426 - Flags: review?(vdjeric)
(Assignee)

Updated

3 years ago
No longer blocks: 1127919
(Assignee)

Updated

3 years ago
Blocks: 1127919
No longer blocks: 1069869
> I vaguely remember that histograms and the telemetry API should only be used
> from the same thread. Is that not the case anymore?

That's not correct.

Some of the Telemetry.idl methods get called from other threads and there are some C++ methods exposed via Telemetry.h that access the same data from multiple threads (e.g. RecordSlowSQLStatement). 

The histograms can be accumulated from a not-main thread (e.g. COMPOSITE_TIME) or even multiple threads (e.g. MOZ_SQLITE_PLACES_READ_MS). These accumulations are not synchronized!

The thinking is that Telemetry doesn't have to be 100% accurate because we only look at the results in aggregate, and we definitely don't want to pay the cost of synchronizing, we just need to make sure there's no data corruption.

> I'm asking because the cloning i added is only triggered from the histogram
> accumulate/add code paths.

Accumulate is not synchronized, so cloning could race with accumulate.

> If we need this, where would you prefer to protect this? Synchronizing
> Telemetry call paths or on the histograms themselves?

On second thought, I think this is just something we might have to tolerate. Let's document Telemetry behavior wrt threads in your in-tree doc.

> > @@ +1019,5 @@
> > > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId)
> > > +{
> > > +  Histogram *existing = nullptr;
> > > +  nsresult rv = GetHistogramByEnumId(existingId, &existing);
> > > +  if (NS_FAILED(rv)) {
> > 
> > why not put this check into the other CloneHistogram function? That way
> > we'll know the check is done even if some future code calls the 3-parameter
> > version of CloneHistogram directly
> 
> I did not want to call GetHistogramByEnumId() unnecessarily - one caller /
> call path already has the histogram, so the lookup is not needed.

You could only call GetHistogramByEnumId() in the 3-parameter version of CloneHistogram, but only if the ID is some known invalid value. But that would be messier cause the IDs are an auto-generated uint32_t enum. So nevermind :)

> > @@ +2194,5 @@
> > > +    Histogram* original = h;
> > > +    if (duplicates) {
> > > +      h = GetDuplicateHistogram(*h);
> > > +      if (!h) {
> > > +        continue;
> > 
> > why would we want to continue here?
> 
> Because keyed histograms follow a different code path and we don't get
> duplicates for the existing duplicate histograms - GetDuplicateHistogram()
> fails for them.

Oh I see, GetDuplicateHistogram returns a nullptr if the histogram is keyed or if there's something really wrong with Telemetry (e.g. histogram ID not found).
Flags: needinfo?(vdjeric)
Comment on attachment 8562373 [details] [diff] [review]
Part 2: Duplicate normal histograms

Review of attachment 8562373 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/Telemetry.cpp
@@ +993,5 @@
>  }
>  
> +// This clones a histogram |existing| with the id |existingId| to a
> +// new histogram with the name |newName|.
> +// For simplicity this is limited to registered histograms.

nit: use the /** .. */ style

@@ +1036,5 @@
> +Histogram*
> +GetSubsessionHistogram(Histogram& existing)
> +{
> +  Telemetry::ID id;
> +  nsresult rv = TelemetryImpl::GetHistogramEnumId(existing.histogram_name().c_str(), &id);

existing.id() has the histogram ID

@@ +1236,5 @@
> +  bool onlySubsession = false;
> +  JS::CallArgs args = JS::CallArgsFromVp(argc, vp);
> +
> +  if (args.length() >= 1) {
> +    if (!(args[0].isNumber() || args[0].isBoolean())) {

do we really want to allow number arguments here?

@@ +1248,4 @@
>    Histogram *h = static_cast<Histogram*>(JS_GetPrivate(obj));
> +  MOZ_ASSERT(h);
> +  if(!onlySubsession) {
> +    h->Clear();

it likely won't matter, but clear() is going to be racey too
Attachment #8562373 - Flags: review?(vdjeric) → review+
Comment on attachment 8562368 [details] [diff] [review]
Part 1: Duplicate keyed histograms

Review of attachment 8562368 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
@@ +610,5 @@
> +  const KEY = "foo";
> +
> +  // Both original and subsession should start out the same.
> +  h.clear();
> +  Assert.ok(!("foo" in h.snapshot()));

use KEY in these asserts instead of "foo"
Attachment #8562368 - Flags: review?(vdjeric) → review+
Comment on attachment 8562426 [details] [diff] [review]
Part 3: Submit duplicate histograms for non-classic payload reasons

Review of attachment 8562426 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +792,5 @@
>     * to |this.getSimpleMeasurements| and |this.getMetadata|,
>     * respectively.
>     */
> +  assemblePayloadWithMeasurements: function(simpleMeasurements, info, reason) {
> +    const classicReasons = [

this is pretty ugly, but I see how it allows you to make fewer changes.

@@ +807,5 @@
>      let payloadObj = {
>        ver: PAYLOAD_VERSION,
>        simpleMeasurements: simpleMeasurements,
> +      histograms: this.getHistograms(isSubsession),
> +      keyedHistograms: this.getKeyedHistograms(isSubsession),

which code is going to call clear() on the subsession histograms?

::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
@@ +615,5 @@
> +  // Both classic and subsession payload histograms should start the same.
> +  count.clear();
> +  keyed.clear();
> +  let classic = TelemetrySession.getPayload();
> +  let subsession = TelemetrySession.getPayload("environment-change");

might want to explicitly test getPayload("saved-session") == getPayload()

@@ +641,5 @@
> +  Assert.ok(KEYED_ID in subsession.keyedHistograms);
> +  Assert.equal(classic.keyedHistograms[KEYED_ID]["a"].sum, 1);
> +  Assert.equal(classic.keyedHistograms[KEYED_ID]["b"].sum, 1);
> +  Assert.equal(subsession.keyedHistograms[KEYED_ID]["a"].sum, 1);
> +  Assert.equal(subsession.keyedHistograms[KEYED_ID]["b"].sum, 1);

Couldn't you just verify classic has the values you want, and then just Assert.deepEqual(classic, subsession)?

@@ +654,5 @@
> +  Assert.ok(!(COUNT_ID in subsession.histograms));
> +  Assert.ok(KEYED_ID in classic.keyedHistograms);
> +  Assert.ok(KEYED_ID in subsession.keyedHistograms);
> +  Assert.deepEqual(classic.keyedHistograms[KEYED_ID], {});
> +  Assert.deepEqual(subsession.keyedHistograms[KEYED_ID], {});

ditto
Attachment #8562426 - Flags: review?(vdjeric) → review+
There's a lot of changes here, so make sure to do an end-to-end test. You can use this dummy Telemetry server, instructions in README:

https://github.com/vdjeric/gzipServer

Comment 18

3 years ago
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #17)
> There's a lot of changes here, so make sure to do an end-to-end test. You
> can use this dummy Telemetry server, instructions in README:
> 
> https://github.com/vdjeric/gzipServer

You can also run the real Telemetry server locally - instructions: https://github.com/mozilla/telemetry-server#httpserverjs
(Assignee)

Comment 19

3 years ago
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #16)
> @@ +807,5 @@
> >      let payloadObj = {
> >        ver: PAYLOAD_VERSION,
> >        simpleMeasurements: simpleMeasurements,
> > +      histograms: this.getHistograms(isSubsession),
> > +      keyedHistograms: this.getKeyedHistograms(isSubsession),
> 
> which code is going to call clear() on the subsession histograms?

This will happen in bug 1120362, with the session split.

> ::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
> @@ +615,5 @@
> > +  // Both classic and subsession payload histograms should start the same.
> > +  count.clear();
> > +  keyed.clear();
> > +  let classic = TelemetrySession.getPayload();
> > +  let subsession = TelemetrySession.getPayload("environment-change");
> 
> might want to explicitly test getPayload("saved-session") == getPayload()

Between collecting the two payloads, a number of histograms and other values may actually get updated (GC_REASON2, uptime, ...).
Also, we only duplicate registered histograms into the subsession.

We can't do a deep equality test, but i've added:
* verifying that we have the registered histograms in the subsession payload
* deep equality between subsession and classic for specific known-to-be-stable histograms

> @@ +641,5 @@
> > +  Assert.ok(KEYED_ID in subsession.keyedHistograms);
> > +  Assert.equal(classic.keyedHistograms[KEYED_ID]["a"].sum, 1);
> > +  Assert.equal(classic.keyedHistograms[KEYED_ID]["b"].sum, 1);
> > +  Assert.equal(subsession.keyedHistograms[KEYED_ID]["a"].sum, 1);
> > +  Assert.equal(subsession.keyedHistograms[KEYED_ID]["b"].sum, 1);
> 
> Couldn't you just verify classic has the values you want, and then just
> Assert.deepEqual(classic, subsession)?

See above.
(Assignee)

Comment 20

3 years ago
Created attachment 8564286 [details] [diff] [review]
Part 1: Duplicate keyed histograms
Attachment #8562368 - Attachment is obsolete: true
Attachment #8564286 - Flags: review+
(Assignee)

Comment 21

3 years ago
Created attachment 8564287 [details] [diff] [review]
Part 2: Duplicate normal histograms
(Assignee)

Updated

3 years ago
Attachment #8564287 - Flags: review+
(Assignee)

Updated

3 years ago
Attachment #8562373 - Attachment is obsolete: true
(Assignee)

Comment 22

3 years ago
Created attachment 8564288 [details] [diff] [review]
Part 3: Submit duplicate histograms for non-classic payload reasons
Attachment #8562426 - Attachment is obsolete: true
Attachment #8564288 - Flags: review+
> > which code is going to call clear() on the subsession histograms?
> 
> This will happen in bug 1120362, with the session split.

Ok, you'll want the histogram clear() to happen as close as possible to the snapshotting.. the more time elapses between capturing the snapshot and resetting the histograms, the more data gets lost.  so  for example you wouldn't want to have any yields from a Task in between the snapshot & clear
.. a snapshotAndClear method would be ideal
(Assignee)

Comment 25

3 years ago
Good point, added moved that to bug 1120362.
(Assignee)

Comment 26

3 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d1b78819f79f
https://tbpl.mozilla.org/?tree=Try&rev=d1b78819f79f
(Assignee)

Comment 27

3 years ago
Ok, the only failures seem to be hitting MOZ_ASSERT on Android:
03:26:17     INFO -  02-14 03:26:01.593 F/MOZ_Assert( 2258): Assertion failure: this->mRecursionLevel > 0, at /builds/slave/try-and-api-11-d-0000000000000/build/xpcom/glue/pldhash.cpp:567

https://tbpl.mozilla.org/php/getParsedLog.php?id=59037148&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=59037188&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=59037236&tree=Try&full=1

So, we should not hit the duplicate code on non-desktop anyway, i'll address that.
We should still understand the issue though in case this is not Android-specific.
(Assignee)

Updated

3 years ago
Blocks: 1069869
No longer blocks: 1127919
(Assignee)

Updated

3 years ago
Whiteboard: [ready]
(Assignee)

Comment 28

3 years ago
Created attachment 8568521 [details] [diff] [review]
Part 1: Duplicate keyed histograms

Rebase.
Attachment #8564286 - Attachment is obsolete: true
Attachment #8568521 - Flags: review+
(Assignee)

Comment 29

3 years ago
https://hg.mozilla.org/integration/fx-team/rev/dc7c9541b697
https://hg.mozilla.org/integration/fx-team/rev/f4c7091564b7
(Assignee)

Comment 30

3 years ago
...and https://hg.mozilla.org/integration/fx-team/rev/1d11dc2e220e
https://hg.mozilla.org/mozilla-central/rev/dc7c9541b697
https://hg.mozilla.org/mozilla-central/rev/f4c7091564b7
https://hg.mozilla.org/mozilla-central/rev/1d11dc2e220e
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox39: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
You need to log in before you can comment on or make changes to this bug.