Closed Bug 1318297 Opened 7 years ago Closed 7 years ago

Implement a minimal health ping to monitor ping send failures

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
3

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox53 --- affected
firefox56 --- fixed

People

(Reporter: gfritzsche, Assigned: katejimmelon)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [measurement:client])

Attachments

(2 files, 24 obsolete files)

3.21 KB, patch
Details | Diff | Splinter Review
28.17 KB, patch
Details | Diff | Splinter Review
Bug 1302662 and other recent work (e.g. bug 1308040) increase the "main" ping payload significantly.
We need to be able to monitor when ping sending is not possible.
We could consider to send one very small status ping for each "main" ping that was:
* discarded for size
* got a rejecting http code back
* (?) timed out in submission

Each small status ping could contain:
* client id
* session id
* (?) subsession id
* a simple reason ("too big", "server rejected", ...)
What do you think about this?
Flags: needinfo?(mreid)
This seems reasonable to me. We need to be able to tell if there is a significant change in fidelity in the sending of the main ping.

Would it make sense to treat this as a use case of the error logging proposed in bug 1194907?
Flags: needinfo?(mreid)
I think the error logging is a much more general problem.
I would focus this specifically on very minimal Telemetry health data (to make sure this never runs into b/w issues or other problems).
We can send other, more general error monitoring solutions with the "main" ping or a new custom ping.
Summary: Make it possible to monitor "main" ping send failures → Add a minimal health ping to monitor "main" ping send failures
From bug 1318297, we should also account for payload assembly failures and other ping generation issues.
See e.g. TELEMETRY_ASSEMBLE_PAYLOAD_EXCEPTION for a failure path.
Per IRC, we might consider sending with HTTP (not HTTPS), to avoid cert failures etc.
We should check on any privacy implications for this though.
client_id in the clear? I can't say that I like the sound of that.

How frequently do we expect to send these?
Depends on: 1344235
Blocks: 1369049
Priority: P3 → P1
Assignee: nobody → kustiuzhanina
Blocks: 1372228
Summary: Add a minimal health ping to monitor "main" ping send failures → Implement a minimal health ping to monitor "main" ping send failures
Summary: Implement a minimal health ping to monitor "main" ping send failures → Implement a minimal health ping to monitor ping send failures
Blocks: 1374270
Depends on: 1373190
Attached patch health_ping_initial.patch (obsolete) — Splinter Review
Attachment #8879909 - Flags: feedback+
Attachment #8879909 - Flags: feedback+ → feedback?(gfritzsche)
Blocks: 1375008
Comment on attachment 8879909 [details] [diff] [review]
health_ping_initial.patch

Review of attachment 8879909 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like a solid approach, thanks!

I've left mostly higher-level comments for now.
The only bigger things that stand out for me are:
- making the TelemetryHealthPing API more readable
- using a more stable clock

I assume we will break out the prioritized send of the health ping into a separate patch.
This seems rather independent, so we could even move that work to a separate bug.

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +54,5 @@
> +
> +    return payload;
> +  },
> +
> +  submitPing() {

Can we always use this as `submitPing(reason)`?
Then we may not need to keep track of it in `this.reason`.

@@ +60,5 @@
> +    if ((now - this.lastSendTime) >= SEND_TICK_DELAY && this.hasDataToSend()) {
> +      this.log.trace("Submit health ping");
> +      let payload = this.assemblePayload();
> +      this.clearData();
> +      this.setLastSendTime(new Date());

One problem to call out here:
`Date` is not necessarily monotonic - it can jump forward or backward with the system clock.
We can use `Services.Telemetry.msSinceProcessStart()` instead, which is guaranteed to be monotonic.

@@ +62,5 @@
> +      let payload = this.assemblePayload();
> +      this.clearData();
> +      this.setLastSendTime(new Date());
> +
> +      TelemetrySend.resetTimeOutToDefault(); // TODO maybe not the right way. For tests

I assume this is dealing with the problem of testing timeout errors?

Have we tried yet solving this in tests by:
- registering a custom ping handler with the tests pingserver
- make the handler time out the first request
- reset the handler after (from within) the first request

That way the first request should time out, but later ones (like the health ping) should go through.

@@ +69,5 @@
> +        Services.tm.dispatchToMainThread(() => r(
> +          TelemetryController.submitExternalPing(this.HEALTH_PING_TYPE, payload, {addClientId: true}))));
> +    } else {
> +      this.reason = "delayed";
> +      Promise.resolve();

You still need to start the timer here?

@@ +76,5 @@
> +
> +  /**
> +   * @param failureTypes - array of type and subtype represented as array: [type, subtype]
> +   */
> +  add(failureTypes) {

I think it would be more clear to use something like:
> recordSendFailure(failureType)
> recordDiscardedPing(pingType)

I'm not sure if all future recorded information will fit the pattern (string -> (string -> number)).
I think its better to keep it simple for now (and not generic), then re-visit this once we have more use-cases.

@@ +116,5 @@
> +  isHealthPing(ping) {
> +    return ping.type === this.HEALTH_PING_TYPE;
> +  },
> +
> +  resetTimeOut() {

Is this test specific?
We could make this into a function `testReset()`, that resets/clears the state of this module.

::: toolkit/components/telemetry/tests/unit/test_TelemetryHealthPing.js
@@ +30,5 @@
> +
> +  let histSuccess = Telemetry.getHistogramById("TELEMETRY_SUCCESS");
> +  histSuccess.clear();
> +
> +  if (PingServer.started) {

We started the ping server already in `asyncSetup`, so i think we don't need this check.

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ +349,5 @@
> +  Assert.deepEqual(histSuccess.snapshot().counts, [0, 2, 0], "Should have recorded sending success.");
> +
> +  TelemetryStorage.loadArchivedPingList().then(loadedInfo => {
> +    let hCount = 0;
> +    for (let [id, info] of loadedInfo) {

This could be more readable as:
> let pings = await TelemetryStorage...
> let healthPings = pings.filter(() => ...);
Attachment #8879909 - Flags: feedback?(gfritzsche)
Comment on attachment 8879909 [details] [diff] [review]
health_ping_initial.patch

Review of attachment 8879909 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +69,5 @@
> +        Services.tm.dispatchToMainThread(() => r(
> +          TelemetryController.submitExternalPing(this.HEALTH_PING_TYPE, payload, {addClientId: true}))));
> +    } else {
> +      this.reason = "delayed";
> +      Promise.resolve();

Are you speaking about Services.tm.dispatchToMainThread? If so, yes, I still need is. Without that health ping for oversized ping will be send before removing oversized ping which will lead to resending oversized bug and getting new health ping.

@@ +116,5 @@
> +  isHealthPing(ping) {
> +    return ping.type === this.HEALTH_PING_TYPE;
> +  },
> +
> +  resetTimeOut() {

Yes, this is only for tests.
(In reply to Kate Ustiuzhanina from comment #11)
> Comment on attachment 8879909 [details] [diff] [review]
> health_ping_initial.patch
> 
> Review of attachment 8879909 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/TelemetryHealthPing.jsm
> @@ +69,5 @@
> > +        Services.tm.dispatchToMainThread(() => r(
> > +          TelemetryController.submitExternalPing(this.HEALTH_PING_TYPE, payload, {addClientId: true}))));
> > +    } else {
> > +      this.reason = "delayed";
> > +      Promise.resolve();
> 
> Are you speaking about Services.tm.dispatchToMainThread? If so, yes, I still
> need is. Without that health ping for oversized ping will be send before
> removing oversized ping which will lead to resending oversized bug and
> getting new health ping.

I mean we need to start the timer for the 60min interval.
I think currently we would only send data immediately.
If we record an error and already sent data in the last 60min, we will not sent anything now (at least until the next failure is recorded).
Attachment #8879909 - Attachment is obsolete: true
(In reply to Georg Fritzsche [:gfritzsche] from comment #13)
> (In reply to Kate Ustiuzhanina from comment #11)
> > Comment on attachment 8879909 [details] [diff] [review]
> > health_ping_initial.patch
> > 
> > Review of attachment 8879909 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: toolkit/components/telemetry/TelemetryHealthPing.jsm
> > @@ +69,5 @@
> > > +        Services.tm.dispatchToMainThread(() => r(
> > > +          TelemetryController.submitExternalPing(this.HEALTH_PING_TYPE, payload, {addClientId: true}))));
> > > +    } else {
> > > +      this.reason = "delayed";
> > > +      Promise.resolve();
> > 
> > Are you speaking about Services.tm.dispatchToMainThread? If so, yes, I still
> > need is. Without that health ping for oversized ping will be send before
> > removing oversized ping which will lead to resending oversized bug and
> > getting new health ping.
> 
> I mean we need to start the timer for the 60min interval.
> I think currently we would only send data immediately.
> If we record an error and already sent data in the last 60min, we will not
> sent anything now (at least until the next failure is recorded).

Actually, I thought it was the main idea to send health ping immediately if we can or collect data if we can't and send on next failure.
Comment on attachment 8880369 [details] [diff] [review]
AddUpdates_Bug_1318297_Support_initial_implementation_for_Health_ping.patch

Review of attachment 8880369 [details] [diff] [review]:
-----------------------------------------------------------------

This looks solid overall, nice.
The main open question i have high-level is how we will handle sending after the 60min delay.

Once that is implemented i can review more closely, looking into details.

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ +347,5 @@
> +  Assert.deepEqual(histSuccess.snapshot().counts, [0, 2, 0], "Should have recorded sending success.");
> +
> +  let pings = await TelemetryStorage.loadArchivedPingList();
> +  let healthPingsCount = Array.from(pings.values()).filter(p => TelemetryHealthPing.isHealthPing(p)).length;
> +  Assert.equal(healthPingsCount, 1, "Should have recorded 1 health ping");

Can we move this new code to after the `histSendTimeFail` check?

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +71,5 @@
> +  recordDiscardedPing(pingType) {
> +    return this.add(this.PING_TOO_BIG, pingType);
> +  },
> +
> +  add(failureType, failureSubType) {

Maybe rename this to something more clear - maybe `addToFailure()`?

@@ +85,5 @@
> +    this.failures[failureType][failureSubType] = current + 1;
> +
> +    const now = Services.telemetry.msSinceProcessStart();
> +    if ((now - this.lastSendTime) >= SEND_TICK_DELAY) {
> +      return this.submitPing(dataNonEmpty ? "delayed" : "immediately");

I think this should always be `submitPing("immediately")`.
Then in the `else` branch below, you would start a timer.
When the timer is triggered we'd call `submitPing("delayed")`.

@@ +98,5 @@
> +
> +  clearData() {
> +    this.log.trace("Clear existing data");
> +    this.failures = {}; // Maybe not remove errors types but just set values to 0?
> +    this.reason = null;

This looks like a left-over?

@@ +110,5 @@
> +    return ping.type === this.HEALTH_PING_TYPE;
> +  },
> +
> +  testReset() {
> +    this.setLastSendTime(-SEND_TICK_DELAY);

Should this also call `clearData()`?

::: toolkit/components/telemetry/tests/unit/test_TelemetryHealthPing.js
@@ +142,5 @@
> +  Assert.equal(histogramValueCount(histFailure.snapshot()), 1,
> +    "Should have recorded 1 failed ping into histogram.");
> +});
> +
> +function checkHealthPingStructure(ping, expectedFailuresDict, expectedReason) {

Lets keep helper functions on the top (to aid readability).
Attachment #8880369 - Flags: feedback?(gfritzsche)
Attachment #8880369 - Attachment is obsolete: true
Attachment #8880445 - Flags: feedback?(gfritzsche)
Comment on attachment 8880445 [details] [diff] [review]
Update_Bug_1318297_Support_initial_implementation_for_Health_ping_1.patch

Review of attachment 8880445 [details] [diff] [review]:
-----------------------------------------------------------------

I think we need to improve the timeout handling a bit.

I also commented on some other things.

::: outreachy/src2/mozilla-central/toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ -347,5 @@
>    Assert.equal(histSizeExceeded.snapshot().sum, 1, "Telemetry must report 1 oversized ping submitted.");
>    Assert.equal(histDiscardedSize.snapshot().counts[2], 1, "Telemetry must report a 2MB, oversized, ping submitted.");
> -  Assert.deepEqual(histSuccess.snapshot().counts, [0, 1, 0], "Should have recorded sending success.");
> -  Assert.equal(histogramValueCount(histSendTimeSuccess.snapshot()), 1, "Should have recorded send success time.");
> -  Assert.greater(histSendTimeSuccess.snapshot().sum, 0, "Should have recorded send success time.");

It looks like you dropped two checks here.
Is there a reason for this?

::: outreachy/src2/mozilla-central/toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +29,5 @@
> +
> +this.TelemetryHealthPing = {
> +  _logger: null,
> +
> +  lastSendTime: -SEND_TICK_DELAY,

Our convention is to prefix all the internal variables and functions with `_`.

@@ +92,5 @@
> +      return this.submitPing("immediately");
> +    } else {
> +      let submissionDelay = this.sendTickDelay - now - this.lastSendTime;
> +      return new Promise(function (resolve, reject) {
> +        setTimeout(TelemetryHealthPing.submitPing, submissionDelay, "delayed");

A few points here:

I think you don't need to wrap this in a promise:
> setTimeout(...);
> return Promise.resolve();

Will `submitPing()` actually be submitted with the right `this`?
You may need to `TelemetryHealthPing.submitPing.bind(this)` (or use an arrow function here).

How will we cancel this timer on shutdown?
And how will send remaining data on shutdown?
I think that we need to add a `shutdown()` function here and call it from `TelemetryController`:
https://dxr.mozilla.org/mozilla-central/rev/b1b9129838ade91684574f42219b2010928d7db4/toolkit/components/telemetry/TelemetryController.jsm#793

@@ +106,5 @@
> +    this.log.trace("Clear existing data");
> +    this.failures = {};
> +  },
> +
> +  setLastSendTime(timeInMs) {

Can we just directly set the property where needed instead of using this function?

@@ +110,5 @@
> +  setLastSendTime(timeInMs) {
> +    this.lastSendTime = timeInMs;
> +  },
> +
> +  isHealthPing(ping) {

This seems to be used in only one place.
Can we just directly test for the type there and remove this function?

::: outreachy/src2/mozilla-central/toolkit/components/telemetry/tests/unit/test_TelemetryHealthPing.js
@@ +72,5 @@
> +  // Set last submission time as now, so we need to collect data instead of send
> +  TelemetryHealthPing.setLastSendTime(now);
> +
> +  // Set small tick delay instead of one hour
> +  TelemetryHealthPing.setTickDelay(2000);

This makes the test less deterministic and also requires waiting.
A better approach would be to use our "Policy" pattern and override the timeout function TelemetryHealthPing uses, e.g.:
https://dxr.mozilla.org/mozilla-central/search?q=setSchedulerTickTimeout&redirect=false
https://dxr.mozilla.org/mozilla-central/search?q=fakeSchedulerTimer&redirect=false
https://dxr.mozilla.org/mozilla-central/rev/b1b9129838ade91684574f42219b2010928d7db4/toolkit/components/telemetry/TelemetrySend.jsm#122
Attachment #8880445 - Flags: feedback?(gfritzsche) → feedback+
Attachment #8880445 - Attachment is obsolete: true
Attachment #8881258 - Flags: feedback?(gfritzsche)
Comment on attachment 8881258 [details] [diff] [review]
Bug_1318297_Support_initial_implementation_for_Health_ping_.patch

Review of attachment 8881258 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like an incremental patch, not a full one.
Can you update a full/squashed patch for review?

I left some more detailed comments already though on the code that i could see.

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +41,2 @@
>  
> +  _timeoutId: null,

Nit: I don't think we need empty lines between these variable definitions.

@@ +53,5 @@
>    },
>  
> +  /**
> +   * Assemble payload from: failures, osInfo and reason.
> +   * @param reason - how data was submitted "immediately", "delayed", "shutdown"

How about: "@param {String} reason A string indicating the triggering reason (e.g. "immediately", "delayed")."
Ditto below.

@@ +54,5 @@
>  
> +  /**
> +   * Assemble payload from: failures, osInfo and reason.
> +   * @param reason - how data was submitted "immediately", "delayed", "shutdown"
> +   * @returns assembled payload

"@returns {Object} The assembled payload."

@@ +60,1 @@
>    assemblePayload(reason) {

Can we prefix the internal functions with "_"?

@@ +65,5 @@
>      return payload;
>    },
>  
> +  /**
> +   * Assemble and submit failures information.

"Assemble the failure information and submit it." ?

@@ +67,5 @@
>  
> +  /**
> +   * Assemble and submit failures information.
> +   * @param reason - how ping data was submitted "immediately", "delayed", "shutdown"
> +   * @returns {Promise} to submit assembled information.

How about: "... {Promise} Test-only promise that resolves when the ping was stored or sent (if any)."

@@ +75,5 @@
> +    if (this.hasDataToSend()) {
> +      let payload = this.assemblePayload(reason);
> +      this.clearData();
> +
> +      this._lastSendTime = Services.telemetry.msSinceProcessStart();

We should probably wrap `msSinceProcessStart()` like here:
https://dxr.mozilla.org/mozilla-central/rev/c01aa84ded7eb0b3e691f8bcc5cd887c960a779e/toolkit/components/telemetry/TelemetrySend.jsm#105

@@ +87,5 @@
>      }
>    },
>  
> +  /**
> +   * Call addToFailure function

This describes the "how", not the "what".
We can read the code for the technical details of the "how", the comment here can help us understand what the intent is.

How about e.g.: "Record a failure to send a ping out."

@@ +88,5 @@
>    },
>  
> +  /**
> +   * Call addToFailure function
> +   * @param failureType

"@param {String} failureType The type of failure (e.g. "timeout", ...)."

@@ +96,5 @@
>      return this.addToFailure(this.SEND_FAILURE, failureType);
>    },
>  
> +  /**
> +   * Call addToFailure function

This is also about the "how", not the "what".

@@ +106,5 @@
>    },
>  
> +  /**
> +   * Save failureType and failureSubType.
> +   * Call submitPing immediately or on timeout (if passed less that SEND_TICK_DELAY ms)

Similarly, lets focus on the "what" or "intent".
This accumulates failure information and may trigger a ping.

@@ +115,4 @@
>    addToFailure(failureType, failureSubType) {
>      this.log.trace("Add new failure with type and subtype: " + failureType, failureSubType);
>  
> +    if (!this._failures.hasOwnProperty(failureType)) {

Lets keep this simple: `if (!(failureType in this._failures)) ...`

@@ +132,5 @@
>      }
>    },
>  
> +  /**
> +   * @returns Check if _failures information not empty.

Similarly, lets not describe the code details.
The same goes for the following doc comments.

@@ +149,5 @@
>  
> +  /**
> +   * Clear timeout.
> +   * Calls in TelemetryController._cleanupOnShutdown.
> +   * @returns {*|Promise} to submit rest unsent information.

"@returns {Promise} Test-only, resolved when the ping is stored or sent."
Attachment #8881258 - Flags: feedback?(gfritzsche)
Attachment #8881258 - Attachment is obsolete: true
Attachment #8881393 - Flags: feedback?(gfritzsche)
(In reply to Georg Fritzsche [:gfritzsche] from comment #17)
> Comment on attachment 8880445 [details] [diff] [review]
> Update_Bug_1318297_Support_initial_implementation_for_Health_ping_1.patch
> 
> Review of attachment 8880445 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think we need to improve the timeout handling a bit.
> 
> I also commented on some other things.
> 
> :::
> outreachy/src2/mozilla-central/toolkit/components/telemetry/tests/unit/
> test_TelemetrySend.js
> @@ -347,5 @@
> >    Assert.equal(histSizeExceeded.snapshot().sum, 1, "Telemetry must report 1 oversized ping submitted.");
> >    Assert.equal(histDiscardedSize.snapshot().counts[2], 1, "Telemetry must report a 2MB, oversized, ping submitted.");
> > -  Assert.deepEqual(histSuccess.snapshot().counts, [0, 1, 0], "Should have recorded sending success.");
> > -  Assert.equal(histogramValueCount(histSendTimeSuccess.snapshot()), 1, "Should have recorded send success time.");
> > -  Assert.greater(histSendTimeSuccess.snapshot().sum, 0, "Should have recorded send success time.");
> 
> It looks like you dropped two checks here.
> Is there a reason for this?
Instead of this lines I added check that we got ping with type health, I think it is enough and we don't need to check success histograms.
(In reply to Kate Ustiuzhanina from comment #21)
> > :::
> > outreachy/src2/mozilla-central/toolkit/components/telemetry/tests/unit/
> > test_TelemetrySend.js
> > @@ -347,5 @@
> > >    Assert.equal(histSizeExceeded.snapshot().sum, 1, "Telemetry must report 1 oversized ping submitted.");
> > >    Assert.equal(histDiscardedSize.snapshot().counts[2], 1, "Telemetry must report a 2MB, oversized, ping submitted.");
> > > -  Assert.deepEqual(histSuccess.snapshot().counts, [0, 1, 0], "Should have recorded sending success.");
> > > -  Assert.equal(histogramValueCount(histSendTimeSuccess.snapshot()), 1, "Should have recorded send success time.");
> > > -  Assert.greater(histSendTimeSuccess.snapshot().sum, 0, "Should have recorded send success time.");
> > 
> > It looks like you dropped two checks here.
> > Is there a reason for this?
> Instead of this lines I added check that we got ping with type health, I
> think it is enough and we don't need to check success histograms.

I think we should leave in the test coverage for the histograms as long as we have them.
This is testing specifically that those histograms keep working.
Comment on attachment 8881393 [details] [diff] [review]
Bug_1318297_Support_initial_implementation_for_Health_ping_.patch

Review of attachment 8881393 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty clean now, thanks!

The only real concern i have now is: Why are we only recording failures for the "timeout" error?

I started to comment on some more details below. While i added a lot of comments, most of them are just small things like style comments etc.

I didn't completely review test_TelemetryHealthPing.js yet due to time constraints.

::: toolkit/components/telemetry/tests/unit/head.js
@@ +298,5 @@
>    });
>  }
>  
> +function histogramValueCount(histogramSnapshot) {
> +    return histogramSnapshot.counts.reduce((a, b) => a + b);

Nit: We use 2 spaces for indentation in JS files.

::: toolkit/components/telemetry/TelemetrySend.jsm
@@ +187,4 @@
>      return TelemetrySendImpl.pendingPingCount;
>    },
>  
> +  _pingSubmissionTimeout: PING_SUBMIT_TIMEOUT_MS,

Lets store this directly on `TelemetrySendImpl` next to the other property definitions.

@@ +1130,5 @@
>          failure = XHR_ERROR_TYPE[request.errorCode];
>        }
> +
> +      if (failure === "timeout") {
> +        TelemetryHealthPing.recordSendFailure(failure);

What about recording the other failure types?

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ -347,5 @@
>    Assert.equal(histSizeExceeded.snapshot().sum, 1, "Telemetry must report 1 oversized ping submitted.");
>    Assert.equal(histDiscardedSize.snapshot().counts[2], 1, "Telemetry must report a 2MB, oversized, ping submitted.");
> -  Assert.deepEqual(histSuccess.snapshot().counts, [0, 1, 0], "Should have recorded sending success.");
> -  Assert.equal(histogramValueCount(histSendTimeSuccess.snapshot()), 1, "Should have recorded send success time.");
> -  Assert.greater(histSendTimeSuccess.snapshot().sum, 0, "Should have recorded send success time.");

I don't think we should drop those.
Making sure these histograms work is separate of making sure the health ping works.

@@ +348,5 @@
>    Assert.equal(histogramValueCount(histSendTimeFail.snapshot()), 0, "Should not have recorded send failure time.");
> +
> +  let pings = await TelemetryStorage.loadArchivedPingList();
> +  let healthPingsCount = Array.from(pings.values()).filter(p => p.type === TelemetryHealthPing.HEALTH_PING_TYPE).length;
> +  Assert.equal(healthPingsCount, 1, "Should have recorded 1 health ping");

Can we test the health pings contents too, that the discarding was counted?

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +794,5 @@
>  
>      // Now do an orderly shutdown.
>      try {
> +      // Send latest data.
> +      await TelemetryHealthPing.shutdown();

There still be some send activity after this point.

The safe point is to move this to after `TelemetrySend.shutdown()`.
This will also avoid triggering new sending activity and potentially delaying Firefox shutdown.
Moving this line to after TelemetrySend shutdown means that the health ping will just be stored to disk now and sent in the next session.
Bug 1374270 will improve that later.

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +2,5 @@
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +/*
> + * This module is responsible for collecting and sending Helalth Ping and corresponded events.

"... sending the health ping."
How about adding a line like: "It collects data on send failures and other critical issues with Telemetry submissions."

@@ +13,5 @@
> +];
> +
> +const {classes: Cc, interfaces: Ci, utils: Cu, results: Cr} = Components;
> +
> +Cu.import("resource://gre/modules/TelemetryController.jsm", this);

Can we use lazy imports for all these modules? (i.e. XPCOMUtils.defineLazyModuleGetter()).
None of the modules have to be loaded for every Firefox session - if no failure is ever recorded, we may not need to import any of them.

@@ +24,5 @@
> +const Utils = TelemetryUtils;
> +
> +const MS_IN_A_MINUTE = 60 * 1000;
> +
> +const SEND_TICK_DELAY = 60 * MS_IN_A_MINUTE;

Can you add a comment that this describes what this is for?

@@ +34,5 @@
> +  setSchedulerTickTimeout: (callback, delayMs) => setTimeout(callback, delayMs),
> +  clearSchedulerTickTimeout: (id) => clearTimeout(id)
> +};
> +
> +this.TelemetryHealthPing = {

This module looks well-organized now, cheers!

@@ +49,5 @@
> +  PING_TOO_BIG: "pingDiscardedForSize",
> +  SEND_FAILURE: "sendFailure",
> +
> +  /**
> +   * Assemble payload from: _failures, _osInfo and reason.

I think this does not need to list the variables we assemble the data from.

@@ +66,5 @@
> +   * Assemble the failure information and submit it.
> +   * @param {String} reason A string indicating the triggering reason (e.g. "immediately", "delayed", "shutdown").
> +   * @returns {Promise} Test-only promise that resolves when the ping was stored or sent (if any).
> +   */
> +  submitPing(reason) {

I think this in an internal function?
Lets prefix it with "_".

@@ +67,5 @@
> +   * @param {String} reason A string indicating the triggering reason (e.g. "immediately", "delayed", "shutdown").
> +   * @returns {Promise} Test-only promise that resolves when the ping was stored or sent (if any).
> +   */
> +  submitPing(reason) {
> +    this._log.trace("Submit health ping ");

In other Telemetry modules we try to follow the patterns of: this._log.trace("functionName()")
or: this._log.warn("functionName() - some other information.")

Can you adopt the logging in this file to match this?

@@ +93,5 @@
> +    return this.addToFailure(this.SEND_FAILURE, failureType);
> +  },
> +
> +  /**
> +   * Record a discarded ping type to send a ping out.

How about: "Record that a ping was discarded and its type."

@@ +107,5 @@
> +   * @param {String} failureType The type of failure (e.g. "timeout", ...).
> +   * @param {String} failureSubType The subtype of failure (e.g. ping type, ...).
> +   * @returns {Promise} Test-only, resolved when the ping is stored or sent.
> +   */
> +  addToFailure(failureType, failureSubType) {

This is an internal function.

@@ +117,5 @@
> +
> +    let current = this._failures[failureType][failureSubType] || 0;
> +    this._failures[failureType][failureSubType] = current + 1;
> +
> +    const now = Services.telemetry.msSinceProcessStart();

Use `Utils.monotonicNow()`?

@@ +119,5 @@
> +    this._failures[failureType][failureSubType] = current + 1;
> +
> +    const now = Services.telemetry.msSinceProcessStart();
> +    if ((now - this._lastSendTime) >= SEND_TICK_DELAY) {
> +      return this.submitPing("immediately");

Nit: How about "immediate"?

@@ +130,5 @@
> +
> +  /**
> +   * @returns {boolean} Check the availability of recorded failures data.
> +   */
> +  hasDataToSend() {

This is an internal function.

@@ +137,5 @@
> +
> +  /**
> +   * Clear recorded failures data.
> +   */
> +  clearData() {

This is an internal function.

::: toolkit/components/telemetry/tests/unit/test_TelemetryHealthPing.js
@@ +10,5 @@
> +Cu.import("resource://gre/modules/TelemetryHealthPing.jsm", this);
> +Cu.import("resource://gre/modules/TelemetryStorage.jsm", this);
> +Cu.import("resource://gre/modules/Preferences.jsm", this);
> +
> +function checkHealthPingStructure(ping, expectedFailuresDict, expectedReason) {

If you make `expectedFailuresDict` the last parameter, you could use this like:
> checkHealthPingStructure(ping, "immediately", {
>   // ... expected failures.
> }

@@ +13,5 @@
> +
> +function checkHealthPingStructure(ping, expectedFailuresDict, expectedReason) {
> +  let payload = ping.payload;
> +
> +  Assert.equal(payload.reason, expectedReason, "Should have recorded expected reason");

Lets also test the `os` field (that its present, correct types, non-empty strings).

@@ +17,5 @@
> +  Assert.equal(payload.reason, expectedReason, "Should have recorded expected reason");
> +
> +  /**
> +   {
> +       failure: {

What is this comment for?
Is it a left-over or do you mean to document the structure?
If the latter, you could add some textual comment.

@@ +21,5 @@
> +       failure: {
> +           subtype: value,
> +       }
> +   }*/
> +  for (let key in expectedFailuresDict) {

You can use something like:
> for (let [failureKey, failure] of Object.entries(expectedFailuresDict)) ...

See: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Object/entries

@@ +22,5 @@
> +           subtype: value,
> +       }
> +   }*/
> +  for (let key in expectedFailuresDict) {
> +    Assert.ok(key in payload, "Failures should have recorded expected type of failure");

Can you add the expected type to the message?
Template strings can make this nicer: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/template_strings

@@ +24,5 @@
> +   }*/
> +  for (let key in expectedFailuresDict) {
> +    Assert.ok(key in payload, "Failures should have recorded expected type of failure");
> +
> +    for (let innerKey in expectedFailuresDict[key]) {

> for (let [subtypeKey, count] of Object.entries(failure)) ...

@@ +26,5 @@
> +    Assert.ok(key in payload, "Failures should have recorded expected type of failure");
> +
> +    for (let innerKey in expectedFailuresDict[key]) {
> +      Assert.ok(innerKey in payload[key],
> +        "Failures should have recorded expected subtype of failure");

Can you add the type and subtype to the message?

@@ +28,5 @@
> +    for (let innerKey in expectedFailuresDict[key]) {
> +      Assert.ok(innerKey in payload[key],
> +        "Failures should have recorded expected subtype of failure");
> +      Assert.equal(payload[key][innerKey], expectedFailuresDict[key][innerKey],
> +        "Should have recorded this type of failure")

Can add the type and subtype to the message?

@@ +42,5 @@
> +  Services.prefs.setBoolPref(PREF_TELEMETRY_ENABLED, true);
> +});
> +
> +add_task(async function asyncSetup() {
> +  await TelemetryController.testSetup();

We can actually merge `test_setup()` and `asyncSetup()`.
We just have those separate in other tests because for historical reasons (previously the first test function run couldn't be async).

@@ +53,5 @@
> +add_task(async function test_sendImmediately() {
> +  PingServer.clearRequests();
> +
> +  let histSuccess = Telemetry.getHistogramById("TELEMETRY_SUCCESS");
> +  histSuccess.clear();

Do we need to test the histograms here?
They seem unrelated to the health ping module thats being tested here.
Attachment #8881393 - Flags: feedback?(gfritzsche) → feedback+
Attachment #8881393 - Attachment is obsolete: true
Attachment #8882132 - Flags: feedback? → feedback?(gfritzsche)
Support choosing (with sorting) top 10 ping types.
Attachment #8882132 - Attachment is obsolete: true
Attachment #8882132 - Flags: feedback?(gfritzsche)
Attachment #8882559 - Flags: feedback?(gfritzsche)
Comment on attachment 8882559 [details] [diff] [review]
Bug_1318297_Support_initial_implementation_for_Health_ping_.patch

Review of attachment 8882559 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty close.
I haven't reviewed the test yet this time; i can look over it later today.

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +18,5 @@
> +
> +XPCOMUtils.defineLazyModuleGetter(this, "TelemetryController",
> +  "resource://gre/modules/TelemetryController.jsm");
> +
> +XPCOMUtils.defineLazyModuleGetter(

Nit: We don't need empty lines between the lazy imports here.

@@ +58,5 @@
> +this.TelemetryHealthPing = {
> +  Reason: Object.freeze({
> +    IMMEDIATE: "immediate",
> +    DELAYED: "delayed",
> +    SHUT_DOWN: "shutdown",

Can we comment on what each of these reasons mean?

@@ +69,5 @@
> +
> +  _logger: null,
> +  _lastSendTime: -SEND_TICK_DELAY,
> +  _failures: {},
> +  _timeoutId: null,

These 3 could use a comment explaining what they are for.
For `_failures` specifically lets point out the intended format (basically keyed scalars).

@@ +73,5 @@
> +  _timeoutId: null,
> +  _osInfo: {
> +    "name": Services.appinfo.OS,
> +    "version": Services.sysinfo.get("kernel_version") || Services.sysinfo.get("version")
> +  },

Can this be a constant too?

@@ +75,5 @@
> +    "name": Services.appinfo.OS,
> +    "version": Services.sysinfo.get("kernel_version") || Services.sysinfo.get("version")
> +  },
> +
> +  HEALTH_PING_TYPE: "health",

Nit: Lets move this constant to before the non-constant properties (`_logger` etc.).

@@ +97,5 @@
> +  },
> +
> +  /**
> +   * Assemble payload.
> +   * @param {String} reason A string indicating the triggering reason (e.g. "immediately", "delayed", "shutdown").

Nit: "immediate"

@@ +106,5 @@
> +    let payload = {};
> +    payload.os = this._osInfo;
> +    payload.reason = reason;
> +
> +    Object.keys(this._failures).forEach(key => {

for (let key of Object.keys(...)) ...

@@ +122,5 @@
> +   * Sort input dictionary descending by value.
> +   * @param {Object} failures A dictionary of failures subtype and count.
> +   * @returns {Object} Sorted failures by value.
> +   */
> +  _sortFailuresWithValue(failures) {

This doesn't just sort and is specific to the discarded pings.
Maybe just call it "_getTopDiscardFailures()" or so?

@@ +130,5 @@
> +    });
> +
> +    let result = {};
> +    sortedItems.slice(0, MAX_SEND_DISCARDED_PINGS).forEach(key => {
> +      result[key[0]] = key[1];

.forEach(([key, value]) => ...)

@@ +142,5 @@
> +   * @param {String} reason A string indicating the triggering reason (e.g. "immediately", "delayed", "shutdown").
> +   * @returns {Promise} Test-only promise that resolves when the ping was stored or sent (if any).
> +   */
> +  _submitPing(reason) {
> +    this._log.trace("_submitPing()");

Should we also log `reason` here?

@@ +150,5 @@
> +
> +      this._lastSendTime = Utils.monotonicNow();
> +
> +      return new Promise(r =>
> +        Services.tm.dispatchToMainThread(() => r(

Lets put a comment here on why we use `dispatchToMainThread()`.

@@ +207,5 @@
> +   */
> +  shutdown() {
> +    this._log.trace("shutdown()");
> +    if (this._timeoutId) {
> +      Policy.clearSchedulerTickTimeout(this._timeoutId);

Also reset `this._timeoutId`.

@@ +219,5 @@
> +  testReset() {
> +    this._lastSendTime = -SEND_TICK_DELAY;
> +    this._clearData();
> +    if (this._timeoutId) {
> +      Policy.clearSchedulerTickTimeout(this._timeoutId);

Also reset `this._timeoutId`.
Attachment #8882559 - Flags: feedback?(gfritzsche) → feedback+
Attachment #8882559 - Attachment is obsolete: true
Attachment #8883023 - Flags: feedback?(gfritzsche)
Comment on attachment 8883023 [details] [diff] [review]
Bug_1318297_Support_initial_implementation_for_Health_ping_.patch

Review of attachment 8883023 [details] [diff] [review]:
-----------------------------------------------------------------

I looked only through the tests this time.
The test cases cover the relevant scenarios, but there are some details that can be improved (notably how this overrides the timeouts & using internal functions).

A side-note: The commit message of the patch looks not very useful (you can e.g. see it in the raw patch or by following the "splinter review" link).

::: toolkit/components/telemetry/tests/unit/test_TelemetryHealthPing.js
@@ +17,5 @@
> +function checkHealthPingStructure(ping, expectedReason, expectedFailuresDict) {
> +  let payload = ping.payload;
> +
> +  Assert.equal(payload.reason, expectedReason, "Should have recorded expected reason");
> +  Assert.deepEqual(payload.os, TelemetryHealthPing.OsInfo, "Should have recorded correct os info");

Lets also check that the payload contains exactly the keys "reason", "os" and whatever is in `expectedFailuresDict`?
That tests that we don't record any additional properties.

@@ +21,5 @@
> +  Assert.deepEqual(payload.os, TelemetryHealthPing.OsInfo, "Should have recorded correct os info");
> +
> +  for (let [failureKey, failure] of Object.entries(expectedFailuresDict)) {
> +    Assert.deepEqual(payload[failureKey], failure,
> +      "Should have recorded failure with type: " + failureKey + " and subtype: " + Object.keys(failure)[0]);

Side-note, optional: If you want, you could also use template strings here:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals

@@ +43,5 @@
> +add_task(async function test_sendImmediately() {
> +  PingServer.clearRequests();
> +  TelemetryHealthPing.testReset();
> +
> +  await TelemetryHealthPing._addToFailure("testProblem", "testPingType");

We should not use internal functions unless absolutely necessary for test purposes.
You could use the .record* functions or just trigger the problems in TelemetrySend instead.

@@ +57,5 @@
> +  PingServer.clearRequests();
> +  TelemetryHealthPing.testReset();
> +
> +  // Set last submission time as now, so we need to collect data instead of send
> +  TelemetryHealthPing._lastSendTime = Utils.monotonicNow();

You can do this without accessing any internal properties by triggering a health ping first.

@@ +61,5 @@
> +  TelemetryHealthPing._lastSendTime = Utils.monotonicNow();
> +
> +  let PING_TYPE = "delayed-ping";
> +
> +  // Send first oversized ping, health ping must not be send now.

"Send oversized pings first, ..."

@@ +70,5 @@
> +  // Set up small timeout and fake time out handler.
> +  let telemetryHealthPing = Cu.import("resource://gre/modules/TelemetryHealthPing.jsm", {});
> +  let timeout =
> +    telemetryHealthPing.Policy.setSchedulerTickTimeout(
> +      () => TelemetryHealthPing._submitPing(TelemetryHealthPing.Reason.DELAYED), 2000

You can use this more realistically by overwriting the scheduler timeout like here:
https://dxr.mozilla.org/mozilla-central/rev/6f8f10f48ace5692256efd91f011bd23054ee2ec/toolkit/components/telemetry/tests/unit/head.js#191
https://dxr.mozilla.org/mozilla-central/rev/6f8f10f48ace5692256efd91f011bd23054ee2ec/toolkit/components/telemetry/tests/unit/test_SubsessionChaining.js#141

Then use something like:
let callback = null;
fakeHealthTimer((cb) => callback = cb);
// ... now trigger TelemetrySend failure.
// ... then trigger the timer and wait for the ping to go out:
await callback();

@@ +86,5 @@
> +
> +  telemetryHealthPing.Policy.clearSchedulerTickTimeout(timeout);
> +});
> +
> +add_task(async function test_sendOverSizedPing() {

test_sendOnDelay() is testing with over-sized pings too.
We should probably move this one before it then (or remove it)?

@@ +119,5 @@
> +  PingServer.registerPingHandler((request, result) => {
> +    PingServer.resetPingHandler();
> +    TelemetrySend.resetTimeOutToDefault();
> +    result.processAsync();
> +    result.finish();

Does this reliably time out if you already call `result.finish()` here?
I would assume this is dependent on timing.

A safer way would be to:
- not call result.finish() here
- wait for the health ping to be received
- then .finish() (if that is still needed?)

@@ +160,5 @@
> +  let telemetryHealthPing = Cu.import("resource://gre/modules/TelemetryHealthPing.jsm", {});
> +
> +  let timeout =
> +    telemetryHealthPing.Policy.setSchedulerTickTimeout(
> +      () => TelemetryHealthPing._submitPing(TelemetryHealthPing.Reason.DELAYED), 2000

This should also override like above for better coverage.

@@ +168,5 @@
> +  Assert.equal(ping.type, TelemetryHealthPing.HEALTH_PING_TYPE, "Should have recorded a health ping.");
> +
> +  let expectedPayload = {};
> +  expectedPayload[TelemetryHealthPing.FailureType.DISCARDED_FOR_SIZE] =
> +    TelemetryHealthPing._getTopDiscardFailures(discardedForSizeFailures);

Lets not use internal functions here. We can define how the expected payload should look like in the test.
Attachment #8883023 - Flags: feedback?(gfritzsche)
Attachment #8883023 - Attachment is obsolete: true
Attachment #8883328 - Flags: feedback?(gfritzsche)
Attached patch healthPing.patch (obsolete) — Splinter Review
Update patch format
Attachment #8883328 - Attachment is obsolete: true
Attachment #8883328 - Flags: feedback?(gfritzsche)
Comment on attachment 8883328 [details] [diff] [review]
Bug_1318297_Support_initial_implementation_for_Health_ping__Health_ping_is_supposed_to_be_.patch

Review of attachment 8883328 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this is pretty much done.
Next steps:
- There are some small style comments below etc. that should be fixed.
- I want to take the patch for a quick test run (with the next patch upload that i can apply).
- We need to do a minimal budget estimate before landing this (beware of not sharing release user numbers).
- We need documentation for the ping so we can ask for data review (this is probably easiest to upload in a separate patch on this bug).

::: toolkit/components/telemetry/TelemetrySend.jsm
@@ +187,4 @@
>      return TelemetrySendImpl.pendingPingCount;
>    },
>  
> +  setTimeoutForPingSubmit(timeoutInMS) {

Sorry for calling this out late, but...
We already use a `Policy` object in TelemetrySend.jsm, could we use that for overriding the timeout?

We can do this in a follow-up bug though, so:
- please file a bug for it
- lets name the functions here "testSetTimeout..." and "testReset..." to make clear they are for testing only.

@@ +1190,5 @@
>                 .add(Math.floor(pingSizeBytes / 1024 / 1024));
>        // We don't need to call |request.abort()| as it was not sent yet.
>        this._pendingPingRequests.delete(id);
> +
> +      TelemetryHealthPing.recordDiscardedPing(ping.type);

Lets file a bug (blocking bug 1372228) to consider recording from the other location where we discard pings:
https://dxr.mozilla.org/mozilla-central/search?q=TELEMETRY_PING_SIZE_EXCEEDED_PENDING+path%3ATelemetryStorage.jsm&redirect=false

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ +351,4 @@
>    Assert.greater(histSendTimeSuccess.snapshot().sum, 0, "Should have recorded send success time.");
>    Assert.equal(histogramValueCount(histSendTimeFail.snapshot()), 0, "Should not have recorded send failure time.");
> +
> +  Assert.equal(ping.type, TelemetryHealthPing.HEALTH_PING_TYPE, "Should have recorded 1 over sized ping.");

"Should have received a health ping."

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +63,5 @@
> +
> +  _logger: null,
> +
> +  // Health ping sends every SEND_TICK_DELAY.
> +  // Set up last send time to be able to send first failures immediately.

Suggestions:
"The health ping is send every ..."
"Initialize this so that ..."

@@ +67,5 @@
> +  // Set up last send time to be able to send first failures immediately.
> +  _lastSendTime: -SEND_TICK_DELAY,
> +
> +  /**
> +   * This module collect failures while can't send them. Failures structure:

How about: "This stores reported send failures with the following structure:"

@@ +77,5 @@
> +   *  },
> +   *  ...
> +   * }
> +   */
> +   _failures: {},

Nit: This is indented one space too far.

@@ +107,5 @@
> +  _assemblePayload(reason) {
> +    this._log.trace("_assemblePayload()");
> +    let payload = {};
> +    payload.os = this.OsInfo;
> +    payload.reason = reason;

A small nit:
let payload = {
  os: ...
  reason: ...
};

@@ +141,5 @@
> +  },
> +
> +  /**
> +   * Assemble the failure information and submit it.
> +   * @param {String} reason A string indicating the triggering reason (e.g. "immediately", "delayed", "shutdown").

Nit: "immediate"

@@ +154,5 @@
> +      this._lastSendTime = Utils.monotonicNow();
> +
> +      return new Promise(r =>
> +        // Send health ping on the next tick in order to wait for removing discarded pings (As Health ping collection
> +        // instrumented before promise in TelemetrySend was resolved).

What exactly was the problem if we don't do that? Can we call this out?

@@ +226,5 @@
> +    return this._submitPing(this.Reason.SHUT_DOWN);
> +  },
> +
> +  /**
> +   * Test-only, restore state to initial.

Small nit: I believe that is "restore to initial state."?

::: toolkit/components/telemetry/tests/unit/test_TelemetryHealthPing.js
@@ +50,5 @@
> +  await TelemetryHealthPing.recordSendFailure("testProblem");
> +  let ping = await PingServer.promiseNextPing();
> +  checkHealthPingStructure(ping, {
> +    [TelemetryHealthPing.FailureType.SEND_FAILURE]: {
> +      "testProblem": 1

These tests are very clearly structured and easy to read!

@@ +61,5 @@
> +add_task(async function test_sendOnDelay() {
> +  PingServer.clearRequests();
> +  TelemetryHealthPing.testReset();
> +
> +  // Submit health ping now to update TelemetryHealthPing._lastSendTime

"This first failure should immediately trigger a ping. After this, subsequent failures should be throttled." ?

@@ +62,5 @@
> +  PingServer.clearRequests();
> +  TelemetryHealthPing.testReset();
> +
> +  // Submit health ping now to update TelemetryHealthPing._lastSendTime
> +  await TelemetryHealthPing.recordSendFailure("testFailure");

We can `await PingServer.promiseNextPing()` here.
Then we can below wait for exactly one ping too.

@@ +140,5 @@
> +  PingServer.clearRequests();
> +  let PING_TYPE = "sort-discarded";
> +
> +  // Submit health ping now to update TelemetryHealthPing._lastSendTime
> +  await TelemetryHealthPing.recordSendFailure("testFailure");

Same here, `await ...promiseNextPing()` here makes the expected behavior more clear.
Attachment #8883328 - Attachment is obsolete: false
Attachment #8883328 - Attachment is obsolete: true
(In reply to Georg Fritzsche [:gfritzsche] from comment #31)
> @@ +154,5 @@
> > +      this._lastSendTime = Utils.monotonicNow();
> > +
> > +      return new Promise(r =>
> > +        // Send health ping on the next tick in order to wait for removing discarded pings (As Health ping collection
> > +        // instrumented before promise in TelemetrySend was resolved).
> 
> What exactly was the problem if we don't do that? Can we call this out?

No, we can't. The problem is: when sombody submits oversized ping, then Health ping sending is triggered, but when TelemetryHealthPing module sumbits health ping, oversied ping will not be removed to that time and TelemetrySend will try to resend it again, which lead to data duplication.
Sending Health ping on the next tick we at first wait for removing oversized ping and than could easily send health ping.
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8883917 - Attachment is obsolete: true
Attachment #8884280 - Flags: feedback?(gfritzsche)
Comment on attachment 8884280 [details] [diff] [review]
healthPing.patch

Review of attachment 8884280 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this is basically done.
There are just a last few small things to be done.

Last steps for this patch:
- fix the few last nits
- fix the eslint errors (./mach eslint toolkit/components/telemetry)
- fix test failures (./mach xpcshell-test toolkit/components/telemetry/tests/unit)
- check that it succeeds on try (this is basically a "test run" for our CI, i can push it there)

Before we can land anything on this bug, we need to:
- have a patch for documentation
- get data collection review on it: https://wiki.mozilla.org/Firefox/Data_Collection

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +157,5 @@
> +      return new Promise(r =>
> +        // When somebody submits oversized ping Health ping sending is triggered, but if health ping was submitted
> +        // immediately (not on next tick), oversied ping will not be removed and will be recorded twice.
> +        // This allows sending health ping on the next tick in order to wait for removing discarded pings before
> +        // submitting Health ping.

How about (correct me if i got this wrong):
"If we submit the health ping immediately, the send task would be triggered again before discarding oversized pings from the queue. To work around this, we send the ping on the next tick."

::: toolkit/components/telemetry/tests/unit/test_TelemetryHealthPing.js
@@ +22,5 @@
> +    Assert.deepEqual(payload[key], value, "Should have recorded correct entry with key: " + key);
> +  }
> +}
> +
> +function fakeSchedulerTimer(set, clear) {

This conflicts with `fakeSchedulerTimer()` in head.js.
Rename this to e.g. `fakeHealthSchedulerTimer()`?

@@ +28,5 @@
> +  telemetryHealthPing.Policy.setSchedulerTickTimeout = set;
> +  telemetryHealthPing.Policy.clearSchedulerTickTimeout = clear;
> +}
> +
> +add_task(async function asyncSetup() {

Small nit: Lets just call the function `setup()`?

@@ +63,5 @@
> +  TelemetryHealthPing.testReset();
> +
> +  // This first failure should immediately trigger a ping. After this, subsequent failures should be throttled.
> +  await TelemetryHealthPing.recordSendFailure("testFailure");
> +  await PingServer.promiseNextPing();

Lets check for the health ping type here.
This will make it easier to diagnose test failures in case we receive unexpected pings etc.

@@ +91,5 @@
> +  TelemetryHealthPing.testReset();
> +  PingServer.clearRequests();
> +  let OVER_SIZED_PING_TYPE = "over-sized-ping";
> +
> +  let overSizedData = generateRandomString(2 * 1024 * 1024);

Nit: Unneeded new-line before this?

@@ +113,5 @@
> +
> +  // Set up small ping submission timeout to always have timeout error.
> +  TelemetrySend.testSetTimeoutForPingSubmit(2);
> +
> +  // This ping handler works only for first ping and reset timeout to be able to send health ping.

"Reset the timeout after receiving the first ping to ..." ?

@@ +120,5 @@
> +    TelemetrySend.testResetTimeOutToDefault();
> +  });
> +
> +  await TelemetryController.submitExternalPing(PING_TYPE, {});
> +

Style nit: unneeded new-line? Single lines don't make good "paragraphs" for reading.

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ +330,4 @@
>    // Submit a ping of a normal size and check that we don't count it in the histogram.
>    await TelemetryController.submitExternalPing(TEST_PING_TYPE, { test: "test" });
>    await TelemetrySend.testWaitOnOutgoingPings();
> +  await PingServer.promiseNextPing();

We can actually drop the `testWaitOnOutgoingPings()` line now.
`promiseNextPing()` is sufficient to wait for the ping here.

@@ +351,5 @@
>    Assert.greater(histSendTimeSuccess.snapshot().sum, 0, "Should have recorded send success time.");
>    Assert.equal(histogramValueCount(histSendTimeFail.snapshot()), 0, "Should not have recorded send failure time.");
> +
> +  Assert.equal(ping.type, TelemetryHealthPing.HEALTH_PING_TYPE, "Should have received a health ping.");
> +  Assert.deepEqual(ping.payload[TelemetryHealthPing.FailureType.DISCARDED_FOR_SIZE], { [TEST_PING_TYPE]: 1 },

Nit: This line is pretty long. Just put the second argument on a new line?
Attachment #8884280 - Flags: feedback?(gfritzsche) → feedback+
::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ +330,4 @@
>    // Submit a ping of a normal size and check that we don't count it in the histogram.
>    await TelemetryController.submitExternalPing(TEST_PING_TYPE, { test: "test" });
>    await TelemetrySend.testWaitOnOutgoingPings();
> +  await PingServer.promiseNextPing();

We can actually drop the `testWaitOnOutgoingPings()` line now.
`promiseNextPing()` is sufficient to wait for the ping here.
It seems that I can't remove this line. If I remove it, I get error on this line:
Assert.deepEqual(histSuccess.snapshot().counts, [0, 1, 0], "Should have recorded sending success.");
(In reply to Kate Ustiuzhanina from comment #35)
> ::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
> @@ +330,4 @@
> >    // Submit a ping of a normal size and check that we don't count it in the histogram.
> >    await TelemetryController.submitExternalPing(TEST_PING_TYPE, { test: "test" });
> >    await TelemetrySend.testWaitOnOutgoingPings();
> > +  await PingServer.promiseNextPing();
> 
> We can actually drop the `testWaitOnOutgoingPings()` line now.
> `promiseNextPing()` is sufficient to wait for the ping here.
> It seems that I can't remove this line. If I remove it, I get error on this
> line:
> Assert.deepEqual(histSuccess.snapshot().counts, [0, 1, 0], "Should have
> recorded sending success.");

Ok, interesting, lets just leave it then for now.
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8884280 - Attachment is obsolete: true
Attachment #8885241 - Flags: review?(gfritzsche)
Comment on attachment 8885241 [details] [diff] [review]
healthPing.patch

Review of attachment 8885241 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, thanks!

Two small nits below.
Also, two nits on the commit message:
- "Bug 1318297 - Implement Telemetry health ping." ?
- Add an empty line after the first (that is the standard for separating short description and longer text).

::: toolkit/components/telemetry/tests/unit/test_PingAPI.js
@@ +121,4 @@
>    await TelemetryController.testReset();
>  
>    let pingList = await TelemetryArchive.promiseArchivedPingList();
> +  pingList = pingList.filter(ping => ping.type !== "health");

This should be more robust:
.filter(ping => ping.type.startsWith("test-ping-api-"))

@@ +178,4 @@
>    // Check that we are still picking up the valid archived pings on disk,
>    // plus the valid ones above.
>    pingList = await TelemetryArchive.promiseArchivedPingList();
> +  pingList = pingList.filter(ping => ping.type !== "health");

.filter(ping => ping.type == "foo")
Attachment #8885241 - Flags: review?(gfritzsche) → review+
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8885241 - Attachment is obsolete: true
Attachment #8885278 - Flags: review?(gfritzsche)
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8885278 - Attachment is obsolete: true
Attachment #8885278 - Flags: review?(gfritzsche)
Attachment #8885355 - Flags: review?(gfritzsche)
Comment on attachment 8885355 [details] [diff] [review]
healthPing.patch

Review of attachment 8885355 [details] [diff] [review]:
-----------------------------------------------------------------

This applies and runs most tests fine now for me locally.

However, test_PingAPI.js is failing with:
> 0:02.78 TEST_STATUS: Thread-1 test_archiveCleanup FAIL [test_archiveCleanup : 260] All the pings must be live-accumulated in the histogram. - 23 == 22
Attachment #8885355 - Flags: review?(gfritzsche)
Points: --- → 3
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8885355 - Attachment is obsolete: true
Attachment #8885694 - Flags: review?(gfritzsche)
Comment on attachment 8885694 [details] [diff] [review]
healthPing.patch

Review of attachment 8885694 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetryHealthPing.jsm
@@ +148,5 @@
> +   * @param {String} reason A string indicating the triggering reason (e.g. "immediate", "delayed", "shutdown").
> +   * @returns {Promise} Test-only promise that resolves when the ping was stored or sent (if any).
> +   */
> +  _submitPing(reason) {
> +    if (IS_HEALTH_PING_ENABLED) {

This is getting indented a little far.
Can we use early returns here?

> this._log.trace(...);
> if (!ENABLED || !hasDataToSend()) {
>   return;
> ...

::: toolkit/components/telemetry/docs/internals/preferences.rst
@@ +137,5 @@
>  ``toolkit.telemetry.send.overrideOfficialCheck``
>  
>    If true, allows sending pings on unofficial builds. Requires a restart.
> +
> +``toolkit.telemetry.send.overrideOfficialCheck``

``...healthping.enabled``

@@ +139,5 @@
>    If true, allows sending pings on unofficial builds. Requires a restart.
> +
> +``toolkit.telemetry.send.overrideOfficialCheck``
> +
> +  If true, allows sending health pings via ``TelemetryHealthPing.jsm``.

"If false, sending health pings is disabled. Defaults to true."

::: toolkit/components/telemetry/tests/unit/test_PingAPI.js
@@ +126,4 @@
>  
>    let pingList = await TelemetryArchive.promiseArchivedPingList();
>    Assert.deepEqual(expectedPingList, pingList,
> +    "Should have submitted pings in archive list after restart");

I don't think we need to touch this line in this patch.
Attachment #8885694 - Flags: review?(gfritzsche) → feedback+
Also, for the commit message, can you still add the empty line after the first line?
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8885694 - Attachment is obsolete: true
Attachment #8885735 - Flags: review?(gfritzsche)
Comment on attachment 8885735 [details] [diff] [review]
healthPing.patch

Review of attachment 8885735 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good, thanks!

Next up:
- documentation
- getting data review (based on the documentation)
- signing off on the storage estimate
Attachment #8885735 - Flags: review?(gfritzsche) → review+
(In reply to Georg Fritzsche [:gfritzsche] from comment #46)
> Comment on attachment 8885735 [details] [diff] [review]
> healthPing.patch
> 
> Review of attachment 8885735 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks good, thanks!
> 
> Next up:
> - documentation
> - getting data review (based on the documentation)
> - signing off on the storage estimate

Also last minor thing (doesn't need review from me):
Change the first line of the commit message to match the format in comment 38.
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8885735 - Attachment is obsolete: true
Attached patch docsHealthPing.patch (obsolete) — Splinter Review
Attachment #8886140 - Flags: review?(gfritzsche)
Attached patch docsHealthPing.patch (obsolete) — Splinter Review
Attachment #8886140 - Attachment is obsolete: true
Attachment #8886140 - Flags: review?(gfritzsche)
Attachment #8886150 - Flags: review?(gfritzsche)
Comment on attachment 8886150 [details] [diff] [review]
docsHealthPing.patch

Review of attachment 8886150 [details] [diff] [review]:
-----------------------------------------------------------------

This looks pretty good!

I mostly have small nits below.
The only thing that seems to be missing is a section that clearly details when health pings are expected to be sent.

::: toolkit/components/telemetry/docs/data/health-ping.rst
@@ +3,5 @@
> +============
> +
> +This ping is intended to provide data about problems arose when submitting other pings.
> +Data is submitted not often then one hour. On shutdown additional ping in submitted
> +to avoid loosing collected data.

"The ping submitted at most once per hour."
"On shutdown an additional ping is submitted ..."

@@ +5,5 @@
> +This ping is intended to provide data about problems arose when submitting other pings.
> +Data is submitted not often then one hour. On shutdown additional ping in submitted
> +to avoid loosing collected data.
> +
> +This ping is intended to be really small.

Add "The client id submitted with this ping."

@@ +11,5 @@
> +.. code-block:: js
> +
> +    {
> +      "type": "health", // type
> +      "id": <UUID>, // a UUID that identifies this ping

You can shorten this similar to crash-ping.rst with "// ... common ping data".

@@ +24,5 @@
> +        "os": {
> +            "name": <string>, // OS name
> +            "version": <string> // OS version
> +        },
> +        "reason": <string>, // When ping was triggered

"..., e.g. "immediate" or "shutdown".

@@ +26,5 @@
> +            "version": <string> // OS version
> +        },
> +        "reason": <string>, // When ping was triggered
> +        "pingDiscardedForSize": {
> +            "pingType1": <number>, // Amount of occurrences for specific ping type.

"... a specific ..."

@@ +31,5 @@
> +            ...
> +            "pingTypeN": <number>
> +        },
> +        "sendFailure": {
> +            "failureName1": <number>, // Amount of occurrences for specific failure.

"... for a specific ..."

@@ +38,5 @@
> +        }
> +      }
> +    }
> +
> +Field details

Before this, can you add a section on the send behavior?
I.e. the exact scenarios in which pings are triggered.
The "expected behaviors" section in the new-profile-ping.rst could serve as an example, or the breakdown at the top of main-ping.rst.

@@ +43,5 @@
> +-------------
> +
> +reason
> +~~~~~~
> +The ``reason`` contains the information about when "health" ping was submitted. Now it supports three types:

"The ``reason`` field ..." ?
Similar for the other field descriptions below.

@@ +52,5 @@
> +
> +pingDiscardedForSize
> +~~~~~~~~~~~~~~~~~~~~
> +The ``pingDiscardedForSize`` contains the information about top ten pings, which size was more than
> +allowable (1 mb). Information include type of ping and number occurrences of over sized ping of this type.

"... whose size exceeded the ping size limit" ?
"This fields lists the number of discarded pings per ping type." ?

@@ +54,5 @@
> +~~~~~~~~~~~~~~~~~~~~
> +The ``pingDiscardedForSize`` contains the information about top ten pings, which size was more than
> +allowable (1 mb). Information include type of ping and number occurrences of over sized ping of this type.
> +
> +It is optional.

"This field is optional." ?

@@ +61,5 @@
> +~~~~~~~~~~~
> +The ``sendFailure`` contains the information about pings, which had failures on sending.
> +Information include type of failure and number occurrences of failed to send ping.
> +
> +It is optional.

"This field is optional." ?

@@ +66,5 @@
> +
> +.. note::
> +
> +    Although both ``pingDiscardedForSize`` and ``sendFailure`` fields are optional, health ping could be
> +    submitted if and only if one of this field not empty.

"..., the health ping will only be submitted if ..." ?
Attachment #8886150 - Flags: review?(gfritzsche) → feedback+
Also, please set a commit message with the same form.
With multiple patches on a bug you could also use the "Bug xxx - Part N - Foo bar." form.
Attached patch docsHealthPing.patch (obsolete) — Splinter Review
Attachment #8886150 - Attachment is obsolete: true
Attachment #8886247 - Flags: review?(gfritzsche)
Comment on attachment 8886247 [details] [diff] [review]
docsHealthPing.patch

Review of attachment 8886247 [details] [diff] [review]:
-----------------------------------------------------------------

Mostly i would like us to be more detailed about the send behavior, to make this documentation more useful to our users.

::: toolkit/components/telemetry/docs/data/health-ping.rst
@@ +2,5 @@
> +"health" ping
> +============
> +
> +This ping is intended to provide data about problems arose when submitting other pings.
> +The ping submitted at most once per hour. On shutdown an additional ping is submitted

"The ping is submitted"

@@ +15,5 @@
> +      "type": "health", // type
> +      ... common ping data
> +      "clientId": <UUID>, // client id, e.g.
> +                            // "c641eacf-c30c-4171-b403-f077724e848a"
> +      "application": { ... },

Application is part of the common ping data, you can leave it out here.

@@ +44,5 @@
> +* The size of other assembled ping exceed the ping limit.
> +* There was a failure while sending other ping.
> +
> +After recording the data ping will be triggered immediately, or after a timeout, if it pass less then a hour before
> +previous submission. Ping would be triggered on shutdown too, if recorded data is not empty.

Can we make this a clear list of triggering scenarios?
- sent immediately with reason ... when ...
- every x min with reason ... when ...
- on shutdown ...

While this is similar to the reason list below, i would like to see more details for the triggering reasons here.
The background here is that data engineers and analysts should be able to work just from this documentation, without reading our code or asking us.

@@ +60,5 @@
> +
> +pingDiscardedForSize
> +~~~~~~~~~~~~~~~~~~~~
> +The ``pingDiscardedForSize`` field contains the information about top ten pings, whose size exceeded the
> +ping size limit (1 mb). This fields lists the number of discarded pings per ping type.

"This field lists ..."

@@ +67,5 @@
> +
> +sendFailure
> +~~~~~~~~~~~
> +The ``sendFailure`` field contains the information about pings, which had failures on sending.
> +This fields lists the number of failed pings per ping send failure type.

"This field lists ..."
Attachment #8886247 - Flags: review?(gfritzsche) → feedback+
Attached patch docsHealthPing.patch (obsolete) — Splinter Review
Attachment #8886247 - Attachment is obsolete: true
Attachment #8886541 - Flags: review?(gfritzsche)
Comment on attachment 8886541 [details] [diff] [review]
docsHealthPing.patch

Review of attachment 8886541 [details] [diff] [review]:
-----------------------------------------------------------------

That looks more clear, thanks!

Next step:
Getting data collection review on this doc patch: https://wiki.mozilla.org/Firefox/Data_Collection
You can flag for feedback of any of the listed peers.
Attachment #8886541 - Flags: review?(gfritzsche) → review+
Hi François, 
could you, please, make a data collection review for docsHealthPing.patch?
Attachment #8886541 - Flags: review+ → review?(francois)
Comment on attachment 8886541 [details] [diff] [review]
docsHealthPing.patch

Hi Chenxia,

Could you please make data review for docsHealthPing.patch?
Attachment #8886541 - Flags: review?(francois) → review?(liuche)
Will do! I'll take a look.
Comment on attachment 8886541 [details] [diff] [review]
docsHealthPing.patch

Review of attachment 8886541 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good to me, it's clearly Type 1 Technical Data ( https://wiki.mozilla.org/Firefox/Data_Collection ) and this doc is detailed in when and what it is sending. data-review+

Thanks for your patience and sorry for the delay!

::: toolkit/components/telemetry/docs/data/health-ping.rst
@@ +1,5 @@
> +
> +"health" ping
> +============
> +
> +This ping is intended to provide data about problems arose when submitting other pings.

typo: "problems that arise when"

@@ +3,5 @@
> +============
> +
> +This ping is intended to provide data about problems arose when submitting other pings.
> +The ping is submitted at most once per hour. On shutdown an additional ping is submitted
> +to avoid loosing collected data.

typo: losing

@@ +12,5 @@
> +.. code-block:: js
> +
> +    {
> +      "type": "health", // type
> +      ... common ping data

Can you elaborate on what "common ping data" is? If this is coreping, that's clear, or maybe you could link to where that's defined.

@@ +22,5 @@
> +            "version": <string> // OS version
> +        },
> +        "reason": <string>, // When ping was triggered, e.g. "immediate" or "shutdown".
> +        "pingDiscardedForSize": {
> +            "pingType1": <number>, // Amount of occurrences for a specific ping type.

Optional: give examples of ping types, or link to what kinds of ping types?
Attachment #8886541 - Flags: review?(liuche) → review+
Attached patch docsHealthPing.patch (obsolete) — Splinter Review
Attachment #8886541 - Attachment is obsolete: true
Attachment #8889400 - Attachment is obsolete: true
(In reply to Chenxia Liu [:liuche] from comment #62)
> Comment on attachment 8886541 [details] [diff] [review]
> docsHealthPing.patch
> 
> Review of attachment 8886541 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks good to me, it's clearly Type 1 Technical Data (
> https://wiki.mozilla.org/Firefox/Data_Collection ) and this doc is detailed
> in when and what it is sending. data-review+
> 
> Thanks for your patience and sorry for the delay!
> 
> ::: toolkit/components/telemetry/docs/data/health-ping.rst
> @@ +1,5 @@
> > +
> > +"health" ping
> > +============
> > +
> > +This ping is intended to provide data about problems arose when submitting other pings.
> 
> typo: "problems that arise when"
> 
> @@ +3,5 @@
> > +============
> > +
> > +This ping is intended to provide data about problems arose when submitting other pings.
> > +The ping is submitted at most once per hour. On shutdown an additional ping is submitted
> > +to avoid loosing collected data.
> 
> typo: losing
> 
> @@ +12,5 @@
> > +.. code-block:: js
> > +
> > +    {
> > +      "type": "health", // type
> > +      ... common ping data
> 
> Can you elaborate on what "common ping data" is? If this is coreping, that's
> clear, or maybe you could link to where that's defined.
> 
> @@ +22,5 @@
> > +            "version": <string> // OS version
> > +        },
> > +        "reason": <string>, // When ping was triggered, e.g. "immediate" or "shutdown".
> > +        "pingDiscardedForSize": {
> > +            "pingType1": <number>, // Amount of occurrences for a specific ping type.
> 
> Optional: give examples of ping types, or link to what kinds of ping types?

Hi Chenxia! Thank you for the review.
About common data, it is refer to common ping format http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/common-ping.html, this done in other places, e.g. http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/crash-ping.html
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2ceb8a5a3e40
Support initial implementation for Health ping. r=gfritzsche
https://hg.mozilla.org/integration/mozilla-inbound/rev/882a43e7e21a
Add docs for Health ping. r=liuche
Keywords: checkin-needed
backed out for failures like https://treeherder.mozilla.org/logviewer.html#?job_id=116972085&repo=mozilla-inbound
Flags: needinfo?(kustiuzhanina)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3538dbc8dbfe
Backed out changeset 882a43e7e21a 
https://hg.mozilla.org/integration/mozilla-inbound/rev/a035857a0533
Backed out changeset 2ceb8a5a3e40 for xpcshell failures in test_TelemetryHealthPing.js
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8886138 - Attachment is obsolete: true
Flags: needinfo?(kustiuzhanina)
Depends on: 1383807
Attached patch healthPing.patch (obsolete) — Splinter Review
Attachment #8889488 - Attachment is obsolete: true
Attached patch healthPing.patchSplinter Review
Attachment #8889820 - Attachment is obsolete: true
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/d260dd770741
https://hg.mozilla.org/mozilla-central/rev/03d1cedf6f12
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Depends on: 1385417
You need to log in before you can comment on or make changes to this bug.