Closed
Bug 1127914
Opened 10 years ago
Closed 10 years ago
Duplicate telemetry histograms for double submission
Categories
(Toolkit :: Telemetry, defect)
Toolkit
Telemetry
Tracking
()
RESOLVED
FIXED
mozilla39
Tracking | Status | |
---|---|---|
firefox39 | --- | fixed |
People
(Reporter: gfritzsche, Assigned: gfritzsche)
References
Details
(Whiteboard: [ready])
Attachments
(3 files, 8 obsolete files)
15.99 KB,
patch
|
gfritzsche
:
review+
|
Details | Diff | Splinter Review |
13.30 KB,
patch
|
gfritzsche
:
review+
|
Details | Diff | Splinter Review |
15.11 KB,
patch
|
gfritzsche
:
review+
|
Details | Diff | Splinter Review |
Per the discussion on the FHR/Telemetry unification concerns [0] we want to go with client-side submission of both "reset per subsession" and "old style telemetry" histograms.
This means that we have to duplicate histograms.
Bug 1126867 showed that this has acceptable memory costs.
0:"Suggestions for the new unified FHR/Telemetry/Experiment ping", https://mail.mozilla.org/pipermail/fhr-dev/2015-January/thread.html#382
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → gfritzsche
Assignee | ||
Comment 1•10 years ago
|
||
This duplicates keyed histogram data and allows to reset and retrieve the duplicate separately.
Notes:
* I'm keeping this to the minimum required for handling duplicates in TelemetrySession.jsm
* I don't intend to document the duplicate features as this is only for a temporary telemetry-internal use-case
Attachment #8558148 -
Flags: review?(vdjeric)
Assignee | ||
Updated•10 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•10 years ago
|
||
Fixed minor oversight.
Attachment #8558148 -
Attachment is obsolete: true
Attachment #8558148 -
Flags: review?(vdjeric)
Attachment #8558159 -
Flags: review?(vdjeric)
Assignee | ||
Comment 3•10 years ago
|
||
Attachment #8558646 -
Flags: review?(vdjeric)
Assignee | ||
Comment 4•10 years ago
|
||
This still has some test issues, but gives context on the usage of part 1 & 2.
Comment 5•10 years ago
|
||
Comment on attachment 8558646 [details] [diff] [review]
Part 2: Duplicate normal histograms
Review of attachment 8558646 [details] [diff] [review]:
-----------------------------------------------------------------
Given that we expect subsession histograms to eventually completely replace the full-session histograms, shouldn't we make the full-session histograms be the "duplicates"?
::: toolkit/components/telemetry/Telemetry.cpp
@@ +993,5 @@
> }
>
> +Histogram*
> +CloneHistogram(const nsACString& newName, Telemetry::ID existingId,
> + Histogram& existing)
- add a comment header for these functions explaining what they do & the motivation
@@ +995,5 @@
> +Histogram*
> +CloneHistogram(const nsACString& newName, Telemetry::ID existingId,
> + Histogram& existing)
> +{
> + const TelemetryHistogram &p = gHistograms[existingId];
- I'd prefer a check for nullness of gHistograms[existingId] here, rather than in this function's callers
- pick a more descriptive name than p
@@ +1009,5 @@
> + }
> +
> + Histogram::SampleSet ss;
> + existing.SnapshotSample(&ss);
> + clone->AddSampleSet(ss);
There's a chance of a race condition here, SnapshotSample() is not protected with any locks!
http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/histogram.cc#389
This could cause the sub-session histogram to miss samples added to the full-session histogram.
This wasn't a big deal before because HistogramFrom was only used to create snapshots of histograms at startup. It gets more sensitive for the new use case
@@ +1019,5 @@
> +CloneHistogram(const nsACString& newName, Telemetry::ID existingId)
> +{
> + Histogram *existing = nullptr;
> + nsresult rv = GetHistogramByEnumId(existingId, &existing);
> + if (NS_FAILED(rv)) {
why not put this check into the other CloneHistogram function? That way we'll know the check is done even if some future code calls the 3-parameter version of CloneHistogram directly
@@ +1027,5 @@
> + return CloneHistogram(newName, existingId, *existing);
> +}
> +
> +Histogram*
> +GetDuplicateHistogram(Histogram& existing)
same comment about the "duplicate" name
@@ +1057,5 @@
> +nsresult
> +HistogramAdd(Histogram& histogram, int32_t value)
> +{
> + histogram.Add(value);
> + if (Histogram* duplicate = GetDuplicateHistogram(histogram)) {
I feel like something very visible should happen if we're not able to create a duplicate histogram (e.g. an assert)
@@ +2194,5 @@
> + Histogram* original = h;
> + if (duplicates) {
> + h = GetDuplicateHistogram(*h);
> + if (!h) {
> + continue;
why would we want to continue here?
::: toolkit/components/telemetry/nsITelemetry.idl
@@ +58,5 @@
> /**
> + * As histogramSnapshots, except this contains the internally duplicated histograms for subsession telemetry.
> + */
> + [implicit_jscontext]
> + readonly attribute jsval duplicateHistogramSnapshots;
Duplicate is ambiguous, especially since it's not clear which is which. Let's use another name, maybe "subsessionHistogramSnapshots" or "dailyHistogramSnapshots".
::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
@@ +604,5 @@
> Assert.ok(!registered.has("TELEMETRY_TEST_KEYED_FLAG"));
> Assert.ok(registered.has("TELEMETRY_TEST_KEYED_RELEASE_OPTOUT"));
> }
>
> +function test_duplicates() {
Add another test for calling clear() when there's data in both snapshots
@@ +630,5 @@
> + h.clear(true);
> + h.add(1);
> + snapshot = Telemetry.histogramSnapshots;
> + duplicates = Telemetry.duplicateHistogramSnapshots;
> + Assert.deepEqual(original[ID], duplicates[ID]);
this is a little hard to parse.. why not just check for value of the histogram's .sum field directly?
@@ +637,5 @@
> + // Increasing the count again means the duplicate should now match the second snapshot.
> + h.add(1);
> + snapshot = Telemetry.histogramSnapshots;
> + duplicates = Telemetry.duplicateHistogramSnapshots;
> + Assert.deepEqual(original2[ID], duplicates[ID]);
ditto
Attachment #8558646 -
Flags: review?(vdjeric)
Comment 6•10 years ago
|
||
Comment on attachment 8558159 [details] [diff] [review]
Part 1: Duplicate keyed histograms
Review of attachment 8558159 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/telemetry/Telemetry.cpp
@@ +3877,5 @@
> +nsresult
> +KeyedHistogram::Add(const nsCString& key, uint32_t sample)
> +{
> + if (!TelemetryImpl::CanRecord()) {
> + return NS_ERROR_FAILURE;
doesn't this mean we're going to report a JS error and throw an exception(?) if Telemetry is turned off?
@@ +3883,5 @@
> +
> + Histogram* histogram = GetHistogram(key, false);
> + Histogram* duplicate = GetHistogram(key, true);
> + if (!histogram || !duplicate) {
> + return NS_ERROR_FAILURE;
in general, i don't like throwing exceptions from Telemetry code.. they're nevercaught by the callers and they're hard to spot when it starts happening. Asserts (or segfaults for OOMs) would be preferable
Attachment #8558159 -
Flags: review?(vdjeric)
Assignee | ||
Comment 7•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #5)
> Given that we expect subsession histograms to eventually completely replace
> the full-session histograms, shouldn't we make the full-session histograms
> be the "duplicates"?
I don't think that matters much, we have to remove one code-path either way and only submit/reset one of them.
> @@ +1009,5 @@
> > + }
> > +
> > + Histogram::SampleSet ss;
> > + existing.SnapshotSample(&ss);
> > + clone->AddSampleSet(ss);
>
> There's a chance of a race condition here, SnapshotSample() is not protected
> with any locks!
>
> http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/
> histogram.cc#389
>
> This could cause the sub-session histogram to miss samples added to the
> full-session histogram.
> This wasn't a big deal before because HistogramFrom was only used to create
> snapshots of histograms at startup. It gets more sensitive for the new use
> case
I vaguely remember that histograms and the telemetry API should only be used from the same thread. Is that not the case anymore?
I'm asking because the cloning i added is only triggered from the histogram accumulate/add code paths.
If we need this, where would you prefer to protect this? Synchronizing Telemetry call paths or on the histograms themselves?
> @@ +1019,5 @@
> > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId)
> > +{
> > + Histogram *existing = nullptr;
> > + nsresult rv = GetHistogramByEnumId(existingId, &existing);
> > + if (NS_FAILED(rv)) {
>
> why not put this check into the other CloneHistogram function? That way
> we'll know the check is done even if some future code calls the 3-parameter
> version of CloneHistogram directly
I did not want to call GetHistogramByEnumId() unnecessarily - one caller / call path already has the histogram, so the lookup is not needed.
> @@ +2194,5 @@
> > + Histogram* original = h;
> > + if (duplicates) {
> > + h = GetDuplicateHistogram(*h);
> > + if (!h) {
> > + continue;
>
> why would we want to continue here?
Because keyed histograms follow a different code path and we don't get duplicates for the existing duplicate histograms - GetDuplicateHistogram() fails for them.
Assignee | ||
Comment 8•10 years ago
|
||
needinfo? on the open questions in comment 7.
Flags: needinfo?(vdjeric)
Assignee | ||
Comment 9•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #5)
> @@ +1057,5 @@
> > +nsresult
> > +HistogramAdd(Histogram& histogram, int32_t value)
> > +{
> > + histogram.Add(value);
> > + if (Histogram* duplicate = GetDuplicateHistogram(histogram)) {
>
> I feel like something very visible should happen if we're not able to create
> a duplicate histogram (e.g. an assert)
For simplicity the additions are limited to registered histograms.
For all other histograms, GetDuplicateHistogram() (or now GetSubsessionHistogram()), returns null, which is not a failure.
We should handle failure cases in GetSubsessionHistogram().
Assignee | ||
Comment 10•10 years ago
|
||
Addressed feedback except the lock protection - i would handle that in a separate patch depending on the answer to the open question.
Attachment #8558159 -
Attachment is obsolete: true
Attachment #8562368 -
Flags: review?(vdjeric)
Assignee | ||
Comment 11•10 years ago
|
||
Addressed feedback except the lock protection - i would handle that in a separate patch depending on the answer to the open question.
Attachment #8558646 -
Attachment is obsolete: true
Attachment #8562373 -
Flags: review?(vdjeric)
Assignee | ||
Comment 12•10 years ago
|
||
This takes care of submitting the duplicates for all session reasons other than the "classic"/"full-session" ones.
Attachment #8559251 -
Attachment is obsolete: true
Attachment #8562426 -
Flags: review?(vdjeric)
Assignee | ||
Updated•10 years ago
|
Comment 13•10 years ago
|
||
> I vaguely remember that histograms and the telemetry API should only be used
> from the same thread. Is that not the case anymore?
That's not correct.
Some of the Telemetry.idl methods get called from other threads and there are some C++ methods exposed via Telemetry.h that access the same data from multiple threads (e.g. RecordSlowSQLStatement).
The histograms can be accumulated from a not-main thread (e.g. COMPOSITE_TIME) or even multiple threads (e.g. MOZ_SQLITE_PLACES_READ_MS). These accumulations are not synchronized!
The thinking is that Telemetry doesn't have to be 100% accurate because we only look at the results in aggregate, and we definitely don't want to pay the cost of synchronizing, we just need to make sure there's no data corruption.
> I'm asking because the cloning i added is only triggered from the histogram
> accumulate/add code paths.
Accumulate is not synchronized, so cloning could race with accumulate.
> If we need this, where would you prefer to protect this? Synchronizing
> Telemetry call paths or on the histograms themselves?
On second thought, I think this is just something we might have to tolerate. Let's document Telemetry behavior wrt threads in your in-tree doc.
> > @@ +1019,5 @@
> > > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId)
> > > +{
> > > + Histogram *existing = nullptr;
> > > + nsresult rv = GetHistogramByEnumId(existingId, &existing);
> > > + if (NS_FAILED(rv)) {
> >
> > why not put this check into the other CloneHistogram function? That way
> > we'll know the check is done even if some future code calls the 3-parameter
> > version of CloneHistogram directly
>
> I did not want to call GetHistogramByEnumId() unnecessarily - one caller /
> call path already has the histogram, so the lookup is not needed.
You could only call GetHistogramByEnumId() in the 3-parameter version of CloneHistogram, but only if the ID is some known invalid value. But that would be messier cause the IDs are an auto-generated uint32_t enum. So nevermind :)
> > @@ +2194,5 @@
> > > + Histogram* original = h;
> > > + if (duplicates) {
> > > + h = GetDuplicateHistogram(*h);
> > > + if (!h) {
> > > + continue;
> >
> > why would we want to continue here?
>
> Because keyed histograms follow a different code path and we don't get
> duplicates for the existing duplicate histograms - GetDuplicateHistogram()
> fails for them.
Oh I see, GetDuplicateHistogram returns a nullptr if the histogram is keyed or if there's something really wrong with Telemetry (e.g. histogram ID not found).
Flags: needinfo?(vdjeric)
Comment 14•10 years ago
|
||
Comment on attachment 8562373 [details] [diff] [review]
Part 2: Duplicate normal histograms
Review of attachment 8562373 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/telemetry/Telemetry.cpp
@@ +993,5 @@
> }
>
> +// This clones a histogram |existing| with the id |existingId| to a
> +// new histogram with the name |newName|.
> +// For simplicity this is limited to registered histograms.
nit: use the /** .. */ style
@@ +1036,5 @@
> +Histogram*
> +GetSubsessionHistogram(Histogram& existing)
> +{
> + Telemetry::ID id;
> + nsresult rv = TelemetryImpl::GetHistogramEnumId(existing.histogram_name().c_str(), &id);
existing.id() has the histogram ID
@@ +1236,5 @@
> + bool onlySubsession = false;
> + JS::CallArgs args = JS::CallArgsFromVp(argc, vp);
> +
> + if (args.length() >= 1) {
> + if (!(args[0].isNumber() || args[0].isBoolean())) {
do we really want to allow number arguments here?
@@ +1248,4 @@
> Histogram *h = static_cast<Histogram*>(JS_GetPrivate(obj));
> + MOZ_ASSERT(h);
> + if(!onlySubsession) {
> + h->Clear();
it likely won't matter, but clear() is going to be racey too
Attachment #8562373 -
Flags: review?(vdjeric) → review+
Comment 15•10 years ago
|
||
Comment on attachment 8562368 [details] [diff] [review]
Part 1: Duplicate keyed histograms
Review of attachment 8562368 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js
@@ +610,5 @@
> + const KEY = "foo";
> +
> + // Both original and subsession should start out the same.
> + h.clear();
> + Assert.ok(!("foo" in h.snapshot()));
use KEY in these asserts instead of "foo"
Attachment #8562368 -
Flags: review?(vdjeric) → review+
Comment 16•10 years ago
|
||
Comment on attachment 8562426 [details] [diff] [review]
Part 3: Submit duplicate histograms for non-classic payload reasons
Review of attachment 8562426 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +792,5 @@
> * to |this.getSimpleMeasurements| and |this.getMetadata|,
> * respectively.
> */
> + assemblePayloadWithMeasurements: function(simpleMeasurements, info, reason) {
> + const classicReasons = [
this is pretty ugly, but I see how it allows you to make fewer changes.
@@ +807,5 @@
> let payloadObj = {
> ver: PAYLOAD_VERSION,
> simpleMeasurements: simpleMeasurements,
> + histograms: this.getHistograms(isSubsession),
> + keyedHistograms: this.getKeyedHistograms(isSubsession),
which code is going to call clear() on the subsession histograms?
::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
@@ +615,5 @@
> + // Both classic and subsession payload histograms should start the same.
> + count.clear();
> + keyed.clear();
> + let classic = TelemetrySession.getPayload();
> + let subsession = TelemetrySession.getPayload("environment-change");
might want to explicitly test getPayload("saved-session") == getPayload()
@@ +641,5 @@
> + Assert.ok(KEYED_ID in subsession.keyedHistograms);
> + Assert.equal(classic.keyedHistograms[KEYED_ID]["a"].sum, 1);
> + Assert.equal(classic.keyedHistograms[KEYED_ID]["b"].sum, 1);
> + Assert.equal(subsession.keyedHistograms[KEYED_ID]["a"].sum, 1);
> + Assert.equal(subsession.keyedHistograms[KEYED_ID]["b"].sum, 1);
Couldn't you just verify classic has the values you want, and then just Assert.deepEqual(classic, subsession)?
@@ +654,5 @@
> + Assert.ok(!(COUNT_ID in subsession.histograms));
> + Assert.ok(KEYED_ID in classic.keyedHistograms);
> + Assert.ok(KEYED_ID in subsession.keyedHistograms);
> + Assert.deepEqual(classic.keyedHistograms[KEYED_ID], {});
> + Assert.deepEqual(subsession.keyedHistograms[KEYED_ID], {});
ditto
Attachment #8562426 -
Flags: review?(vdjeric) → review+
Comment 17•10 years ago
|
||
There's a lot of changes here, so make sure to do an end-to-end test. You can use this dummy Telemetry server, instructions in README:
https://github.com/vdjeric/gzipServer
Comment 18•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #17)
> There's a lot of changes here, so make sure to do an end-to-end test. You
> can use this dummy Telemetry server, instructions in README:
>
> https://github.com/vdjeric/gzipServer
You can also run the real Telemetry server locally - instructions: https://github.com/mozilla/telemetry-server#httpserverjs
Assignee | ||
Comment 19•10 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #16)
> @@ +807,5 @@
> > let payloadObj = {
> > ver: PAYLOAD_VERSION,
> > simpleMeasurements: simpleMeasurements,
> > + histograms: this.getHistograms(isSubsession),
> > + keyedHistograms: this.getKeyedHistograms(isSubsession),
>
> which code is going to call clear() on the subsession histograms?
This will happen in bug 1120362, with the session split.
> ::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js
> @@ +615,5 @@
> > + // Both classic and subsession payload histograms should start the same.
> > + count.clear();
> > + keyed.clear();
> > + let classic = TelemetrySession.getPayload();
> > + let subsession = TelemetrySession.getPayload("environment-change");
>
> might want to explicitly test getPayload("saved-session") == getPayload()
Between collecting the two payloads, a number of histograms and other values may actually get updated (GC_REASON2, uptime, ...).
Also, we only duplicate registered histograms into the subsession.
We can't do a deep equality test, but i've added:
* verifying that we have the registered histograms in the subsession payload
* deep equality between subsession and classic for specific known-to-be-stable histograms
> @@ +641,5 @@
> > + Assert.ok(KEYED_ID in subsession.keyedHistograms);
> > + Assert.equal(classic.keyedHistograms[KEYED_ID]["a"].sum, 1);
> > + Assert.equal(classic.keyedHistograms[KEYED_ID]["b"].sum, 1);
> > + Assert.equal(subsession.keyedHistograms[KEYED_ID]["a"].sum, 1);
> > + Assert.equal(subsession.keyedHistograms[KEYED_ID]["b"].sum, 1);
>
> Couldn't you just verify classic has the values you want, and then just
> Assert.deepEqual(classic, subsession)?
See above.
Assignee | ||
Comment 20•10 years ago
|
||
Attachment #8562368 -
Attachment is obsolete: true
Attachment #8564286 -
Flags: review+
Assignee | ||
Comment 21•10 years ago
|
||
Assignee | ||
Updated•10 years ago
|
Attachment #8564287 -
Flags: review+
Assignee | ||
Updated•10 years ago
|
Attachment #8562373 -
Attachment is obsolete: true
Assignee | ||
Comment 22•10 years ago
|
||
Attachment #8562426 -
Attachment is obsolete: true
Attachment #8564288 -
Flags: review+
Comment 23•10 years ago
|
||
> > which code is going to call clear() on the subsession histograms?
>
> This will happen in bug 1120362, with the session split.
Ok, you'll want the histogram clear() to happen as close as possible to the snapshotting.. the more time elapses between capturing the snapshot and resetting the histograms, the more data gets lost. so for example you wouldn't want to have any yields from a Task in between the snapshot & clear
Comment 24•10 years ago
|
||
.. a snapshotAndClear method would be ideal
Assignee | ||
Comment 25•10 years ago
|
||
Good point, added moved that to bug 1120362.
Assignee | ||
Comment 26•10 years ago
|
||
Assignee | ||
Comment 27•10 years ago
|
||
Ok, the only failures seem to be hitting MOZ_ASSERT on Android:
03:26:17 INFO - 02-14 03:26:01.593 F/MOZ_Assert( 2258): Assertion failure: this->mRecursionLevel > 0, at /builds/slave/try-and-api-11-d-0000000000000/build/xpcom/glue/pldhash.cpp:567
https://tbpl.mozilla.org/php/getParsedLog.php?id=59037148&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=59037188&tree=Try&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=59037236&tree=Try&full=1
So, we should not hit the duplicate code on non-desktop anyway, i'll address that.
We should still understand the issue though in case this is not Android-specific.
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Updated•10 years ago
|
Whiteboard: [ready]
Assignee | ||
Comment 28•10 years ago
|
||
Rebase.
Attachment #8564286 -
Attachment is obsolete: true
Attachment #8568521 -
Flags: review+
Assignee | ||
Comment 29•10 years ago
|
||
Assignee | ||
Comment 30•10 years ago
|
||
Comment 31•10 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/dc7c9541b697
https://hg.mozilla.org/mozilla-central/rev/f4c7091564b7
https://hg.mozilla.org/mozilla-central/rev/1d11dc2e220e
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
status-firefox39:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla39
You need to log in
before you can comment on or make changes to this bug.
Description
•