Closed Bug 1460595 Opened 6 years ago Closed 6 years ago

Send an "event" ping containing Telemetry Events

Categories

(Toolkit :: Telemetry, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox62 + fixed
firefox63 --- fixed

People

(Reporter: chutten, Assigned: chutten)

References

Details

Attachments

(8 files)

This shall be the bug for the work implementing the design being worked on in bug 1442653.

It will require, at the minimum, its own module, some timers, a lot of tests, and changes as the last changes to the design come in.
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: -- → P1
ni?janerik for feedback on these patches.
Flags: needinfo?(jrediger)
Comment on attachment 8976129 [details]
bug 1460595 - Add Event Ping preferences to TelemetryUtils

https://reviewboard.mozilla.org/r/244318/#review250290
Attachment #8976129 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8976131 [details]
bug 1460595 - Implement the 'event' ping

https://reviewboard.mozilla.org/r/244322/#review250336

This looks good overall! It's mostly nits and the suggestion to use the nice stuff from `XPCOMUtils`. With that, the code will look much cleaner!

::: toolkit/components/telemetry/TelemetryController.jsm:747
(Diff revision 1)
>          await this._delayedNewPingTask.finalize();
>        }
>  
>        UpdatePing.shutdown();
>  
> +      TelemetryHealthPing.shutdown();

I think you mean `TelemetryEventPing.shutdown();` here

::: toolkit/components/telemetry/TelemetryEventPing.jsm:31
(Diff revision 1)
> +});
> +
> +ChromeUtils.defineModuleGetter(this, "setTimeout", "resource://gre/modules/Timer.jsm");
> +ChromeUtils.defineModuleGetter(this, "clearTimeout", "resource://gre/modules/Timer.jsm");
> +ChromeUtils.defineModuleGetter(this, "TelemetryUtils", "resource://gre/modules/TelemetryUtils.jsm");
> +ChromeUtils.defineModuleGetter(this, "Services", "resource://gre/modules/Services.jsm");

Should we make this "lazy" (i.e. add it to the list on line 18)? Looks like this only gets used after late telemetry startup.

Or better, any reason not to make them all "lazy"?

::: toolkit/components/telemetry/TelemetryEventPing.jsm:38
(Diff revision 1)
> +const Utils = TelemetryUtils;
> +
> +const MS_IN_A_MINUTE = 60 * 1000;
> +
> +const DEFAULT_EVENT_LIMIT = 1000;
> +const DEFAULT_MIN_FREQUENCY = 60 * MS_IN_A_MINUTE;

nit: I think we usually postfix these constants with the unit we use, so this and the one below should become `DEFAULT_MIN_FREQUENCY_MS`. I have no strong opinion on this though, feel free to ignore :)

::: toolkit/components/telemetry/TelemetryEventPing.jsm:39
(Diff revision 1)
> +
> +const MS_IN_A_MINUTE = 60 * 1000;
> +
> +const DEFAULT_EVENT_LIMIT = 1000;
> +const DEFAULT_MIN_FREQUENCY = 60 * MS_IN_A_MINUTE;
> +const DEFAULT_MAX_FREQUENCY = 10 * MS_IN_A_MINUTE;

super-nit: is this really a `frequency`?

::: toolkit/components/telemetry/TelemetryEventPing.jsm:75
(Diff revision 1)
> +  _lastSendTime: -DEFAULT_MIN_FREQUENCY * MS_IN_A_MINUTE,
> +
> +  _maxEventsPerPing: undefined,
> +  get maxEventsPerPing() {
> +    if (this._maxEventsPerPing === undefined) {
> +      Services.prefs.addObserver(Utils.Preferences.EventPingEventLimit, this);

What if I told... that we don't need this anymore? We can rely on `XPCOMUtils.defineLazyPreferenceGetter` [see here](https://searchfox.org/mozilla-central/rev/00dd116638001772fe354b081353b73f1cad405d/js/xpconnect/loader/XPCOMUtils.jsm#377).

::: toolkit/components/telemetry/TelemetryEventPing.jsm:85
(Diff revision 1)
> +    return this._maxEventsPerPing;
> +  },
> +
> +  _minFrequency: undefined,
> +  get minFrequency() {
> +    if (this._minFrequency === undefined) {

See the `XPCOMUtils.defineLazyPreferenceGetter` comment.

::: toolkit/components/telemetry/TelemetryEventPing.jsm:97
(Diff revision 1)
> +  },
> +
> +  _maxFrequency: undefined,
> +  get maxFrequency() {
> +    if (this._maxFrequency === undefined) {
> +      Services.prefs.addObserver(Utils.Preferences.EventPingMaximumFrequency, this);

As above, see `XPCOMUtils.defineLazyPreferenceGetter`

::: toolkit/components/telemetry/TelemetryEventPing.jsm:106
(Diff revision 1)
> +
> +    return this._maxFrequency;
> +  },
> +
> +  get dataset() {
> +    return Telemetry.canRecordExtended ? Ci.nsITelemetry.DATASET_RELEASE_CHANNEL_OPTIN

nit: should we use `Telemetry.canRecordPrerelease` here instead? We'll change this anyway, eventually.

::: toolkit/components/telemetry/TelemetryEventPing.jsm:118
(Diff revision 1)
> +
> +    this._startTimer();
> +  },
> +
> +  shutdown() {
> +    Services.obs.removeObserver(this, EVENT_LIMIT_REACHED_TOPIC);

This and the one below throw an exception if, for some reason, we [failed to register](https://searchfox.org/mozilla-central/rev/00dd116638001772fe354b081353b73f1cad405d/xpcom/ds/nsObserverService.cpp#251). Should we protect against that? Otherwise we may break the shutdown process in TelemetryController

::: toolkit/components/telemetry/TelemetryEventPing.jsm:120
(Diff revision 1)
> +  },
> +
> +  shutdown() {
> +    Services.obs.removeObserver(this, EVENT_LIMIT_REACHED_TOPIC);
> +    Services.obs.removeObserver(this, PROFILE_BEFORE_CHANGE_TOPIC);
> +    Services.prefs.removeObserver(Utils.Preferences.EventPingEventLimit, this);

If we use the `XPCOMUtils` preferences goodnes, we can get rid of these!

::: toolkit/components/telemetry/TelemetryEventPing.jsm:123
(Diff revision 1)
> +    Services.obs.removeObserver(this, EVENT_LIMIT_REACHED_TOPIC);
> +    Services.obs.removeObserver(this, PROFILE_BEFORE_CHANGE_TOPIC);
> +    Services.prefs.removeObserver(Utils.Preferences.EventPingEventLimit, this);
> +    Services.prefs.removeObserver(Utils.Preferences.EventPingMinimumFrequency, this);
> +    Services.prefs.removeObserver(Utils.Preferences.EventPingMaximumFrequency, this);
> +  },

Do we need to call `clearTimer` here?

::: toolkit/components/telemetry/TelemetryEventPing.jsm:144
(Diff revision 1)
> +        break;
> +      case PROFILE_BEFORE_CHANGE_TOPIC:
> +        this._log.trace("profile before change");
> +        this._submitPing(this.Reason.SHUTDOWN, true /* discardLeftovers */);
> +        break;
> +      case PREF_CHANGED_TOPIC:

We can get rid of this when using the `XPCOMUtils` stuff.

::: toolkit/components/telemetry/TelemetryEventPing.jsm:177
(Diff revision 1)
> +      Policy.clearTimeout(this._timeoutId);
> +      this._timeoutId = null;
> +    }
> +  },
> +
> +  _submitPing(reason, discardLeftovers = false) {

nit: should we add a comment roughly explaining what is this function doing?
Attachment #8976131 - Flags: review?(alessio.placitelli)
Comment on attachment 8976132 [details]
bug 1460595 - Add meta-telemetry to record how many event pings we're sending

https://reviewboard.mozilla.org/r/244324/#review250358
Attachment #8976132 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8976128 [details]
bug 1460595 - Change storage semantics Telemetry Events

https://reviewboard.mozilla.org/r/244316/#review250364

This looks good overall, I think we need to make a decision on the name/semantics here: are we exposing a generic "second event storage", or is it a subsession/session storage? Depending on that, we should change docstrings and parameters accordingly. I'm fine with whatever we go with, as long as it's consistent throughout the changeset.

::: commit-message-09f04:3
(Diff revision 1)
> +bug 1460595 - add a second storage for Telemetry Events r?Dexter
> +
> +MozReview-Commit-ID: EY40tqKxxeW

nit: would you kindly add a few lines to explain why we're adding the second storage?

::: toolkit/components/telemetry/Telemetry.cpp:1799
(Diff revision 1)
>  {
>    return TelemetryEvent::RecordEvent(aCategory, aMethod, aObject, aValue, aExtra, aCx, optional_argc);
>  }
>  
>  NS_IMETHODIMP
> -TelemetryImpl::SnapshotEvents(uint32_t aDataset, bool aClear, JSContext* aCx,
> +TelemetryImpl::SnapshotEvents(uint32_t aDataset, bool aClear, bool aStorageTwo, JSContext* aCx,

Instead of `aStorageTwo`, should we be consistent and call this `aSubsessionStorage`?

::: toolkit/components/telemetry/TelemetryEvent.cpp:331
(Diff revision 1)
>  typedef nsUint32HashKey ProcessIDHashKey;
>  typedef nsTArray<EventRecord> EventRecordArray;
>  typedef nsClassHashtable<ProcessIDHashKey, EventRecordArray> EventRecordsMapType;
>  
>  EventRecordsMapType gEventRecords;
> +EventRecordsMapType gSubsessionEventRecords;

Can we add a comment explaining why we need the second storage?
Attachment #8976128 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8976130 [details]
bug 1460595 - Test "event" ping event storage

https://reviewboard.mozilla.org/r/244320/#review250390
Attachment #8976130 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8976133 [details]
bug 1460595 - Test the 'event' ping

https://reviewboard.mozilla.org/r/244326/#review250392

::: toolkit/components/telemetry/tests/unit/test_TelemetryEventPing.js:34
(Diff revision 1)
> +  mod.Policy.setTimeout = set;
> +  mod.Policy.clearTimeout = clear;
> +  mod.Policy.sendPing = send;
> +}
> +
> +function pass() {}

nit: Can we add a comment line here stating that this is intentionally left empty?

::: toolkit/components/telemetry/tests/unit/test_TelemetryEventPing.js:52
(Diff revision 1)
> +  do_get_profile(true);
> +  // Make sure we don't generate unexpected pings due to pref changes.
> +  await setEmptyPrefWatchlist();
> +
> +  await TelemetryController.testSetup();
> +  TelemetryEventPing.testReset();

Do you need this line? Isn't this implied in the previous `TelemetryController.testSetup();`?
Attachment #8976133 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8976127 [details]
bug 1460595 - Document the event ping

https://reviewboard.mozilla.org/r/244314/#review250394

This looks clear! Can we also update [the internal API docs](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/events.html#internal-api) for the second/session storage?
Attachment #8976127 - Flags: review?(alessio.placitelli)
Comment on attachment 8976128 [details]
bug 1460595 - Change storage semantics Telemetry Events

https://reviewboard.mozilla.org/r/244316/#review250364

> Instead of `aStorageTwo`, should we be consistent and call this `aSubsessionStorage`?

But then it has to default to "true", which I'm not really keen on from an API design front.

I think I'll standardize on "storage two", despite it having approximately zero nutritional value.
Comment on attachment 8976131 [details]
bug 1460595 - Implement the 'event' ping

https://reviewboard.mozilla.org/r/244322/#review250336

> Should we make this "lazy" (i.e. add it to the list on line 18)? Looks like this only gets used after late telemetry startup.
> 
> Or better, any reason not to make them all "lazy"?

I presume TelemetryHealthPing will only be used lazily so the "lazy vs import" decision has to be made based on how close to module use these submodules will be used. Services, setTimeout, and Utils are all used ~immediately to set up the first timer. clearTimeout isn't, but as we're already importing Timer.jsm anyway it shouldn't save us anything.

We could make them all lazy, but I know Timer, Utils, and Services will be used immediately on startup so we may as well save the cycles.

> super-nit: is this really a `frequency`?

I'm trying to highlight through word choice the periodic nature of the code. If you want to bikeshed a different name that's fine, so long as you make a case that it's better :)

> What if I told... that we don't need this anymore? We can rely on `XPCOMUtils.defineLazyPreferenceGetter` [see here](https://searchfox.org/mozilla-central/rev/00dd116638001772fe354b081353b73f1cad405d/js/xpconnect/loader/XPCOMUtils.jsm#377).

I would say "Thank you, I will get rid of this repetitive code and use the utility functions straightaway"
Comment on attachment 8976133 [details]
bug 1460595 - Test the 'event' ping

https://reviewboard.mozilla.org/r/244326/#review250392

> Do you need this line? Isn't this implied in the previous `TelemetryController.testSetup();`?

It's what sets TelemetryEventPing._testing to true. All the individual tests currently also call testReset so we would be fine without it, but I feel safer with it there.
Comment on attachment 8976127 [details]
bug 1460595 - Document the event ping

https://reviewboard.mozilla.org/r/244314/#review250596
Attachment #8976127 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8976131 [details]
bug 1460595 - Implement the 'event' ping

https://reviewboard.mozilla.org/r/244322/#review250606
Attachment #8976131 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8976131 [details]
bug 1460595 - Implement the 'event' ping

https://reviewboard.mozilla.org/r/244322/#review250336

> I presume TelemetryHealthPing will only be used lazily so the "lazy vs import" decision has to be made based on how close to module use these submodules will be used. Services, setTimeout, and Utils are all used ~immediately to set up the first timer. clearTimeout isn't, but as we're already importing Timer.jsm anyway it shouldn't save us anything.
> 
> We could make them all lazy, but I know Timer, Utils, and Services will be used immediately on startup so we may as well save the cycles.

Ok, sounds good to me!

> I'm trying to highlight through word choice the periodic nature of the code. If you want to bikeshed a different name that's fine, so long as you make a case that it's better :)

Sorry, no better proposal here, so let's drop this :)
Comment on attachment 8976127 [details]
bug 1460595 - Document the event ping

https://reviewboard.mozilla.org/r/244314/#review250618

Leading a patch set with the docs. I like that!
Couple of nits, otherwise good to go.

::: toolkit/components/telemetry/docs/data/event-ping.rst:43
(Diff revision 2)
> +to avoid losing collected data.
> +
> +On shutdown, during profile-before-change, a final ping is sent with any remaining event
> +records, regardless of frequency but obeying the event record limit.
> +
> +The 1000-record limit and one-hour and ten-minute frequenecies are controlled by

typo. should be "frequencies"

::: toolkit/components/telemetry/docs/data/event-ping.rst:53
(Diff revision 2)
> +
> +reason
> +~~~~~~
> +The ``reason`` field contains the information about why the "event" ping was submitted:
> +
> +* periodic: The event ping was submitted because an event happened in the past hour.

Make it clear that this is a minimum requirement:

The event ping was submitted because at least one event happened in the past hour.

::: toolkit/components/telemetry/docs/data/event-ping.rst:80
(Diff revision 2)
> +  This may not be the same subsession that the events occurred in if a
> +  :ref:`session split <sessionsplit>` happened in between.
> +
> +lostEventsCount
> +~~~~~~~~~~~~~~~
> +The number of events we had to discard because we reached out 1000-per-ping limit before

typo. should be "because we reached our 1000-per-ping limit" or "because we reached the 1000-per-ping limit"
Attachment #8976127 - Flags: review+
Comment on attachment 8976128 [details]
bug 1460595 - Change storage semantics Telemetry Events

https://reviewboard.mozilla.org/r/244316/#review250620

I agree with :Dexter here. The naming is not really clear.

::: toolkit/components/telemetry/nsITelemetry.idl:483
(Diff revision 2)
>     *     [43258, "category1", "method2", "object1", null, {"key1": "string value"}],
>     *     ...
>     *   ]
>     *
>     * @param aDataset DATASET_RELEASE_CHANNEL_OPTOUT or DATASET_RELEASE_CHANNEL_OPTIN.
> -   * @param [aClear=false] Whether to clear out the scalars after snapshotting.
> +   * @param [aClear=false] Whether to clear out the events after snapshotting.

Good catch.

::: toolkit/components/telemetry/nsITelemetry.idl:484
(Diff revision 2)
>     *     ...
>     *   ]
>     *
>     * @param aDataset DATASET_RELEASE_CHANNEL_OPTOUT or DATASET_RELEASE_CHANNEL_OPTIN.
> -   * @param [aClear=false] Whether to clear out the scalars after snapshotting.
> +   * @param [aClear=false] Whether to clear out the events after snapshotting.
> +   * @param [aStorageTwo=false] Whether to use storage one or two to snapshot and optionally clear the event storage.

From the argument name and the documentation it's not clear what the different storages are for. We should clarify that.

In addition it's not clear to me how this argument affects the storage clearing.
Comment on attachment 8976133 [details]
bug 1460595 - Test the 'event' ping

https://reviewboard.mozilla.org/r/244326/#review250628

::: toolkit/components/telemetry/tests/unit/test_TelemetryEventPing.js:62
(Diff revision 2)
> +  Telemetry.clearEvents();
> +  TelemetryEventPing.testReset();
> +
> +  fakePolicy(pass, pass, fail);
> +  recordEvents(999);
> +  fakePolicy((callback, delay) => {

Nested callbacks replacing the policy itself, that's a tough one.

Took me a while to understand the order the events are recorded in. 
Especially the fact that on the 1000s event the submission is triggered immediately and the `_startTimer` for periodic triggers the first mocked callback. 
Can we clarify this with a comment?

::: toolkit/components/telemetry/tests/unit/test_TelemetryEventPing.js:72
(Diff revision 2)
> +      Assert.ok(options.addEnvironment, "Adds the environment.");
> +      Assert.ok(!options.usePingSender, "Doesn't require pingsender.");
> +      Assert.equal(payload.reason, TelemetryEventPing.Reason.MAX, "Sending because we hit max");
> +      Assert.equal(payload.events.parent.length, 1000, "Has one thousand events");
> +      Assert.equal(payload.lostEventsCount, 0, "Lost no events");
> +      Assert.ok(!payload.events.parent.some(ev => ev[1] === "test2"), "Should not have included any the final event (yet).");

nit. no "any" needed here.

::: toolkit/components/telemetry/tests/unit/test_TelemetryEventPing.js:86
(Diff revision 2)
> +  recordEvents(999);
> +  fakePolicy((callback, delay) => {
> +    Telemetry.recordEvent("telemetry.test", "test2", "object2");
> +    Telemetry.recordEvent("telemetry.test", "test2", "object2");
> +    fakePolicy(pass, pass, (type, payload, options) => {
> +      checkPingStructure(type, payload, options);

This callback is never actually reached because of a JavaScript error being triggered:

JavaScript error: resource://gre/modules/TelemetryEventPing.jsm, line 167: TypeError: this._leftovers.concat is not a function

Probably something like this in TelemetryEventPing.jsm, line 163:

```patch
-        events = this._leftovers.concat(events);
+        events = this._leftovers[process].concat(events);
```

::: toolkit/components/telemetry/tests/unit/test_TelemetryEventPing.js:93
(Diff revision 2)
> +      Assert.ok(options.addEnvironment, "Adds the environment.");
> +      Assert.ok(!options.usePingSender, "Doesn't require pingsender.");
> +      Assert.equal(payload.reason, TelemetryEventPing.Reason.MAX, "Sending because we hit max");
> +      Assert.equal(payload.events.parent.length, 1000, "Has one thousand events");
> +      Assert.equal(payload.lostEventsCount, 3, "Lost three events");
> +      Assert.equal(payload.events.parent[0][1], "test2", "The first event of the second bunch should be the leftover event of the first bunch.");

This needs to be `payload.events.parent[0][2]`
Attachment #8976133 - Flags: review-
Comment on attachment 8976131 [details]
bug 1460595 - Implement the 'event' ping

https://reviewboard.mozilla.org/r/244322/#review250654

::: toolkit/components/telemetry/TelemetryEventPing.jsm:163
(Diff revision 2)
> +    }
> +
> +    let eventCount = 0;
> +    for (let [process, events] of Object.entries(snapshot)) {
> +      if (process in this._leftovers) {
> +        events = this._leftovers.concat(events);

This should be:

```patch
-        events = this._leftovers.concat(events);
+        events = this._leftovers[process].concat(events);
```

::: toolkit/components/telemetry/TelemetryEventPing.jsm:168
(Diff revision 2)
> +        events = this._leftovers.concat(events);
> +        delete this._leftovers[process];
> +      }
> +      eventCount += events.length;
> +      if (events.length > this.maxEventsPerPing) {
> +        let processLeftovers = events.slice(this.maxEventsPerPing);

We also need to slice off the excess events here:

```
events = events.slice(0, this.maxEventsPerPing);
```

::: toolkit/components/telemetry/TelemetryEventPing.jsm:175
(Diff revision 2)
> +          this._leftovers[process] = this._leftovers[process].concat(processLeftovers);
> +        } else {
> +          this._leftovers[process] = processLeftovers;
> +        }
> +        snapshot[process].length = this.maxEventsPerPing;
> +      }

You need to store the events back in the snapshot, otherwise you only modified a copy of the data:

```
snapshot[process] = events;
```
Attachment #8976131 - Flags: review-
Comment on attachment 8976132 [details]
bug 1460595 - Add meta-telemetry to record how many event pings we're sending

https://reviewboard.mozilla.org/r/244324/#review250656
Attachment #8976132 - Flags: review+
Comment on attachment 8976128 [details]
bug 1460595 - Change storage semantics Telemetry Events

https://reviewboard.mozilla.org/r/244316/#review250658
Attachment #8976128 - Flags: review+
Comment on attachment 8976129 [details]
bug 1460595 - Add Event Ping preferences to TelemetryUtils

https://reviewboard.mozilla.org/r/244318/#review250660
Attachment #8976129 - Flags: review+
Comment on attachment 8976130 [details]
bug 1460595 - Test "event" ping event storage

https://reviewboard.mozilla.org/r/244320/#review250662
Attachment #8976130 - Flags: review+
Flags: needinfo?(jrediger)
Comment on attachment 8976131 [details]
bug 1460595 - Implement the 'event' ping

https://reviewboard.mozilla.org/r/244322/#review250664

Cheers for getting this on track!
I just did a curious high-level readover here.

::: toolkit/components/telemetry/TelemetryEventPing.jsm:143
(Diff revision 2)
> +
> +  /**
> +   * Submits an "event" ping and restarts the timer for the next interval.
> +   *
> +   * @param {String} reason The reason we're sending the ping. One of TelemetryEventPing.Reason.
> +   * @param {bool} discardLefotevers Whether to discard event records left over from a previous ping.

Typo: Leftovers

::: toolkit/components/telemetry/TelemetryEventPing.jsm:163
(Diff revision 2)
> +    }
> +
> +    let eventCount = 0;
> +    for (let [process, events] of Object.entries(snapshot)) {
> +      if (process in this._leftovers) {
> +        events = this._leftovers.concat(events);

Looking at the `_leftovers` handling here i wonder:
Could we just ask `snapshotEvents()` to give us "at most N events"?
Comment on attachment 8976127 [details]
bug 1460595 - Document the event ping

https://reviewboard.mozilla.org/r/244314/#review250678

Yay for good and clear documentation!
This covers the semantics and contents well.

::: toolkit/components/telemetry/docs/collection/events.rst:239
(Diff revision 2)
>  Internal API
>  ------------
>  
>  .. code-block:: js
>  
> -  Services.telemetry.snapshotEvents(dataset, clear);
> +  Services.telemetry.snapshotEvents(dataset, clear, storageTwo);

What is `storageTwo`?
Maybe this should just be `storage`?

::: toolkit/components/telemetry/docs/data/event-ping.rst:13
(Diff revision 2)
> +The :doc:`Telemetry Environment <../data/environment>` is submitted in this ping.
> +
> +.. code-block:: js
> +
> +    {
> +      "type": "event",

We should settle the "event" vs. "events" question.

::: toolkit/components/telemetry/docs/data/event-ping.rst:20
(Diff revision 2)
> +      "clientId": <UUID>,
> +      "environment": { ... },
> +      "payload": {
> +        "reason": {periodic, max, shutdown}, // Why the ping was submitted
> +        "sessionStartDate": <ISO Timestamp>, // Hourly precision, for calculating absolute time across pings, same as in the "main" ping
> +        "subsessionId": <UUID>, // For linking to "main" pings

Add `sessionId`?

::: toolkit/components/telemetry/docs/data/event-ping.rst:23
(Diff revision 2)
> +        "reason": {periodic, max, shutdown}, // Why the ping was submitted
> +        "sessionStartDate": <ISO Timestamp>, // Hourly precision, for calculating absolute time across pings, same as in the "main" ping
> +        "subsessionId": <UUID>, // For linking to "main" pings
> +        "lostEventsCount": <number>, // How many events we had to drop. Valid only for reasons "max" and "shutdown"
> +        "events": {
> +          "process1": [

Nit: use a recognizable process name like "main" and add a comment on what this key is for?

::: toolkit/components/telemetry/docs/data/event-ping.rst:53
(Diff revision 2)
> +
> +reason
> +~~~~~~
> +The ``reason`` field contains the information about why the "event" ping was submitted:
> +
> +* periodic: The event ping was submitted because an event happened in the past hour.

Nit: make these standout as keys, e.g. by using "" or making them code-style?

::: toolkit/components/telemetry/docs/internals/preferences.rst:145
(Diff revision 2)
>  ``toolkit.telemetry.maxEventSummaryKeys``
>  
>    Set the maximum number of keys per process of the :ref:`Event Summary <events.event-summary>`
>    :ref:`keyed scalars <scalars.keyed-scalars>`. Default is 500. Change requires restart.
>  
> +``toolkit.telemetry.event.eventLimit``

Should these be `*.eventping.*` for clarity?
I'm not sure if `*.event.*` would tell me much if i was just scanning about:config.
Comment on attachment 8976128 [details]
bug 1460595 - Change storage semantics Telemetry Events

https://reviewboard.mozilla.org/r/244316/#review250684

::: toolkit/components/telemetry/Telemetry.cpp:1799
(Diff revision 2)
>  {
>    return TelemetryEvent::RecordEvent(aCategory, aMethod, aObject, aValue, aExtra, aCx, optional_argc);
>  }
>  
>  NS_IMETHODIMP
> -TelemetryImpl::SnapshotEvents(uint32_t aDataset, bool aClear, JSContext* aCx,
> +TelemetryImpl::SnapshotEvents(uint32_t aDataset, bool aClear, bool aStorageTwo, JSContext* aCx,

Can we choose some more descriptive name than `storageTwo`?
Finding out what which storage is for would require a lot of digging.

::: toolkit/components/telemetry/TelemetryEvent.cpp:336
(Diff revision 2)
> +EventRecordsMapType gEventRecordsTwo;
> +EventRecordsMapType gBuiltinEventRecordsTwo;

Please use more descriptive names here and below.

Or make the storage an actual container of size two and use clearly labeled local variables that document the usage.
Blocks: 1463410
Blocks: 1463439
Comment on attachment 8976131 [details]
bug 1460595 - Implement the 'event' ping

https://reviewboard.mozilla.org/r/244322/#review250664

> Looking at the `_leftovers` handling here i wonder:
> Could we just ask `snapshotEvents()` to give us "at most N events"?

Technically it'll be "at most N events -per process-"... it ought to be more straightforward code to read at the cost of yet another optional argument to snapshotEvents. I'll look into whether the cost balances out.
Comment on attachment 8976127 [details]
bug 1460595 - Document the event ping

https://reviewboard.mozilla.org/r/244314/#review252152

::: toolkit/components/telemetry/docs/data/event-ping.rst:24
(Diff revisions 2 - 3)
> -        "sessionStartDate": <ISO Timestamp>, // Hourly precision, for calculating absolute time across pings, same as in the "main" ping
> +        "processCreationTimestamp": <UNIX Timestamp>, // Minute precision, for calculating absolute time across pings
> +        "sessionId": <UUID>, // For linking to "main" pings
>          "subsessionId": <UUID>, // For linking to "main" pings
>          "lostEventsCount": <number>, // How many events we had to drop. Valid only for reasons "max" and "shutdown"
>          "events": {
> -          "process1": [
> +          "main": [ // process name, one of the keys from Process.yaml

`main` is the only one that's not actually in there. It should be `parent`.

Also: the file is called `Processes.yaml`
Attachment #8976127 - Flags: review+
Comment on attachment 8976128 [details]
bug 1460595 - Change storage semantics Telemetry Events

https://reviewboard.mozilla.org/r/244316/#review252166

That naming is much more descriptive. Good job!

::: toolkit/components/telemetry/TelemetryEvent.cpp:1222
(Diff revisions 2 - 3)
> -    if (aStorageTwo) {
> -      snapshotter(gEventRecordsTwo);
> -      snapshotter(gBuiltinEventRecordsTwo);
> +    if (aEventPingStorage) {
> +      snapshotter(gEventPingEventRecords);
> +      snapshotter(gEventPingBuiltinEventRecords);
>        if (aClear) {
> -        gEventRecordsTwo.Clear();
> -        gBuiltinEventRecordsTwo.Clear();
> +        gEventPingEventRecords.Clear();
> +        gEventPingBuiltinEventRecords.Clear();

You clear the events here and immediately fill them again (if there are leftovers).

Depending on the size of leftover events it might be more efficient to call `ClearAndRetainStorage` (and maybe call `Compact` after filling it).

::: toolkit/components/telemetry/TelemetryEvent.cpp:1226
(Diff revisions 2 - 3)
> -        gEventRecordsTwo.Clear();
> -        gBuiltinEventRecordsTwo.Clear();
> +        gEventPingEventRecords.Clear();
> +        gEventPingBuiltinEventRecords.Clear();
> +        for (auto pair : leftovers) {
> +          gEventPingEventRecords.Put(pair.first(),
> +                                     new EventRecordArray(pair.second()));
> +          gEventPingBuiltinEventRecords.Put(pair.first(),

Won't this duplicate the leftover events?
Comment on attachment 8976131 [details]
bug 1460595 - Implement the 'event' ping

https://reviewboard.mozilla.org/r/244322/#review252178

Moving out the leftover handling from this code definitely makes it easier to follow.
Attachment #8976131 - Flags: review+
Comment on attachment 8976128 [details]
bug 1460595 - Change storage semantics Telemetry Events

https://reviewboard.mozilla.org/r/244316/#review252166

> You clear the events here and immediately fill them again (if there are leftovers).
> 
> Depending on the size of leftover events it might be more efficient to call `ClearAndRetainStorage` (and maybe call `Compact` after filling it).

The size of the datahashtable is at most the number of processes, so I'm not as concerned about the hashtable as I am the EventRecordArrays. I can try to be clever here if you prefer, but I don't think it's worth the additional complexity.

> Won't this duplicate the leftover events?

Good grief, how did I let that slip. I was duplicating things and got a little carried away.

It should be snapshotting, clearing, refilling one at a time.
Comment on attachment 8976127 [details]
bug 1460595 - Document the event ping

https://reviewboard.mozilla.org/r/244314/#review252488

::: toolkit/components/telemetry/docs/collection/events.rst:239
(Diff revision 3)
>  Internal API
>  ------------
>  
>  .. code-block:: js
>  
> -  Services.telemetry.snapshotEvents(dataset, clear);
> +  Services.telemetry.snapshotEvents(dataset, clear, eventPingStorage);

nit: I think this should also mention the optional `aEventLimit`

::: toolkit/components/telemetry/docs/data/event-ping.rst:55
(Diff revision 3)
> +reason
> +~~~~~~
> +The ``reason`` field contains the information about why the "event" ping was submitted:
> +
> +* ``periodic``: The event ping was submitted because at least one event happened in the past hour.
> +* ``max``: The event ping was submitted because the 1000-record limit was reached.

nit: should we mention the `"event-telemetry-storage-limit-reached"` topic here or somewhere else? Maybe saying when this gets notified and that's for internal use only.
DATA COLLECTION REVIEW REQUEST:

    What questions will you answer with this data?

All the same questions that we answer with "main" ping events, but more quickly and permitting more events.

    Why does Mozilla need to answer these questions? Are there benefits for users? Do we need this information to address product or business requirements? Some example responses:

This is an extension of the existing Event Telemetry mechanism with its limitations (1000 event records per subsession, 1 ping per subsession) relieved (1000 event records per ping, max of 1 ping every 10min, min 1 every hour). All the original reasons and benefits for having Event Telemetry apply.

    What alternative methods did you consider to answer these questions? Why were they not sufficient?

Leaving the events on the "main" ping. The "main" ping is getting a little crowded, and is a little slow. Given plans for mass adoption of events (see project Savant and how devtools, normandy, and Activity Stream are adding events) we foresee a need for relaxing our limits.

    Can current instrumentation answer these questions?

Not really. We'll lose events if we keep to the "main" ping's limits. We can kinda-sorta figure out which events we lose by using Event Summary, but that loss of fidelity and ordering means we'll fail to provide full user session replay.

    List all proposed measurements and indicate the category of data collection for each measurement, using the Firefox data collection categories on the Mozilla wiki.

The ping contains events, Environment, clientID, and common ping data. In addition, it includes a minute-resolution timestamp of the main process start and explicit links to related "main" pings via subsessionID.

Combined or separately, this is Type 3. However, this is for the most part a reorganization.

The only -new- piece of information is the per-minute-resolution processCreatedTimestamp. It is at most Type 2, but it's the highest resolution timestamp we have associated with a session, so it may be worthy of discussion.

    How long will this data be collected? Choose one of the following:

Forever. Through Hindsight we will monitor its general behaviour, and individual event owners will monitor their events.

    What populations will you measure?

All

    Which release channels?

All

    Which countries?

All

    Which locales?

All

    Any other filters? Please describe in detail below.

No filter.

    If this data collection is default on, what is the opt-out mechanism for users?

Turn off Telemetry.

    Please provide a general description of how you will analyze this data.

The data will be analysed by the collection authors. The ping behaviour will be checked after about a week in the wild via bug 1463410

    Where do you intend to share the results of your analysis?

bug 1463410
Flags: needinfo?(francois)
Marshall, I want to make sure you're okay with this change because it does involve unique identifiers.

This is about splitting events out of the main telemetry ping.

It contains client_id, environment, and a new minute-resolution timestamp (the timestamp resolution in the main ping is currently hourly). It also links uniquely back to the main ping so we can tie the two together in our analyses.

The time resolution was increased in order to handle users who close their browser and reopen it within the same hour. We can already do this analysis on the main ping but in a very inefficient way.
Flags: needinfo?(francois) → needinfo?(merwin)
That seems ok to me.  This and the main ping will share the same ID?  Or is there a separate ID, in addition to the client_id, that this will collect?
Flags: needinfo?(merwin)
Flags: needinfo?(chutten)
(In reply to Merwin from comment #62)
> That seems ok to me.  This and the main ping will share the same ID?  Or is
> there a separate ID, in addition to the client_id, that this will collect?

It uses subsessionID to identify a unique "main" ping which resulted from the same subsession.
Flags: needinfo?(chutten)
Methodology change time:

Turns out it's easier on the pipeline if we do a hard switch from "main" ping events to "event" ping events, so here's the code adjusted somewhat to this new reality.
Comment on attachment 8984292 [details]
bug 1460595 - Remove events from main pings and, thus, TelemetrySession

https://reviewboard.mozilla.org/r/250100/#review256522

::: commit-message-1f0bb:3
(Diff revision 1)
> +bug 1460595 - Remove events from main pings and, thus, TelemetrySession r?Dexter
> +
> +This requires we take unsent event records out of about:telemetry since its

Hey Chris, this changeset is living in the same stack as the other (old?) changesets that were duplicating the event storage. 

- Is this intended?
- Do we still need the duplication effort given that we're removing this from the main ping?
- I don't think we need to adjust the schema for the main ping in the [schemas GitHub repo](https://github.com/mozilla-services/mozilla-pipeline-schemas), as `events` should be optional.
- Is there any other pipeline dependency we need to take care of before landing this? (but I guess you already checked this)
Comment on attachment 8984292 [details]
bug 1460595 - Remove events from main pings and, thus, TelemetrySession

https://reviewboard.mozilla.org/r/250100/#review256522

> Hey Chris, this changeset is living in the same stack as the other (old?) changesets that were duplicating the event storage. 
> 
> - Is this intended?
> - Do we still need the duplication effort given that we're removing this from the main ping?
> - I don't think we need to adjust the schema for the main ping in the [schemas GitHub repo](https://github.com/mozilla-services/mozilla-pipeline-schemas), as `events` should be optional.
> - Is there any other pipeline dependency we need to take care of before landing this? (but I guess you already checked this)

I edited the commits in place to pretend that it was -always- the plan to switch events. It occurs to me now that it might not have been the friendliest for reviewing this way... Sorry!
Comment on attachment 8984292 [details]
bug 1460595 - Remove events from main pings and, thus, TelemetrySession

https://reviewboard.mozilla.org/r/250100/#review256634

::: toolkit/components/telemetry/TelemetrySession.jsm:1233
(Diff revision 1)
>          // Parent histograms are added to the top-level payload object instead of the process payload.
>          if (processType == "parent" && (key == "histograms" || key == "keyedHistograms")) {
>            payloadLoc = payloadObj;
>          }
>          // The Dynamic process only collects events and scalars.
>          if (processType == "dynamic" && !["events", "scalars"].includes(key)) {

I think we can get rid of the `events` reference here too.

::: toolkit/components/telemetry/TelemetrySession.jsm:1238
(Diff revision 1)
>          if (processType == "dynamic" && !["events", "scalars"].includes(key)) {
>            continue;
>          }
>  
>          // Process measurements can be empty, set a default value.
>          let defaultValue = key == "events" ? [] : {};

Same here
Attachment #8984292 - Flags: review?(alessio.placitelli) → review+
See Also: → 1469321
Marshall, I assume you're okay with Chris' answer in comment 63?
Flags: needinfo?(merwin)
Pushed by chutten@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2eeccf1db8a0
Document the event ping r=Dexter,janerik
https://hg.mozilla.org/integration/autoland/rev/f99bc6f4d002
Change storage semantics Telemetry Events r=Dexter
https://hg.mozilla.org/integration/autoland/rev/264376b1bd3f
Add Event Ping preferences to TelemetryUtils r=Dexter,janerik
https://hg.mozilla.org/integration/autoland/rev/f9ce3e37a291
Test "event" ping event storage r=Dexter,janerik
https://hg.mozilla.org/integration/autoland/rev/e2242d1730a3
Implement the 'event' ping r=Dexter,janerik
https://hg.mozilla.org/integration/autoland/rev/23936736c3b4
Add meta-telemetry to record how many event pings we're sending r=Dexter,janerik
https://hg.mozilla.org/integration/autoland/rev/6c93b28d0ab8
Test the 'event' ping r=Dexter
https://hg.mozilla.org/integration/autoland/rev/378ddda2ffc6
Remove events from main pings and, thus, TelemetrySession r=Dexter
Comment on attachment 8984292 [details]
bug 1460595 - Remove events from main pings and, thus, TelemetrySession

Approval Request Comment
[Feature/Bug causing the regression]:
No regression
[User impact if declined]:
Event Telemetry will be sent the slower way, on the overladen "main" pings, instead of quickly via purpose-built "event" pings.
[Is this code covered by automated tests?]:
Yes.
[Has the fix been verified in Nightly?]:
Yes (see bug 1463410)
[Needs manual test from QE? If yes, steps to reproduce]: 
No
[List of other uplifts needed for the feature/fix]:
This one, then bug 1463439 and bug 1470493 in either order.
[Is the change risky?]:
Yes and no.
[Why is the change risky/not risky?]:
Yes because it involves a few moving parts. No because it is under use on Nightly and appears to be working fine. Also No because it has a handy off switch (bug 1463439) in case things aren't as amazing as they seem.
[String changes made/needed]:
None.
Attachment #8984292 - Flags: approval-mozilla-beta?
Noticing this seems to still be waiting for confirmation from Marshall. I'll try emailing him.
This does seem a little risky but I'm OK with trying it out for next week's beta 7 build, I'd just like confirmation from Marshall since there were unanswered questions in https://bugzilla.mozilla.org/show_bug.cgi?id=1460595#c61.
Yes - Chris answered my question. This seems okay to me.
Flags: needinfo?(merwin)
Chris - a couple more questions. What problems might we look for - perf, overloading our telemetry infrastructure or something, memory issues? Are you going to be watching the data next week on beta? And, also, might this be a good candidate to plan for gradual rollout on release rather than preffing it on by default?
Flags: needinfo?(chutten)
Blocks: 1474295
Our Telemetry ingest has alerts looking at the size and volume of incoming pings. The pipeline will alert us if anything isn't "normal". At this point we've reached saturation on Nightly and the volume[1] and size[2] of incoming pings look good (no suspicious increases or decreases).

As for overloading Telemetry's infrastructure, it's possible but I assign that a low risk. We have weathered some rather strange ping behaviours in the past (some that sent every minute of every day, on a loop) and haven't lost data because of them.

Perf-wise, the most common operation, recording events, is nigh-identical to before and should be perf-neutral. The less-frequent operations should be once an hour (usually just once a day, according to bug 1463410) and fairly lightweight (gather the events, send the ping). I would not expect a perf impact.

Memory is a larger risk as the number of recorded events could grow larger than the hard cap we had before. However there are two mechanisms keeping tabs on that (the 1000-event-records-at-a-time drain and the truncate when it happens more than twice in twenty minutes), and if anything goes wrong we have toolkit.telemetry.eventping.enabled as an emergency halt while we figure things out.

Gradual rollout is incompatible with how the patches were written. Using the new way of sending event records required changing and removing the old code for sending event records. The enabled pref is an emergency stop that halts all events being recorded. It doesn't enable the old way of sending it by "main" ping. So if we gradually rolled it out, there would be a population that wasn't sending any events at all. This is my fault for not considering that gradual rollout would be desirable.

I was planning on monitoring the "event" ping volumes and contents as beta picked it up on an ad hoc basis, but I'm heading out on PTO starting Wednesday so I'll be unable to. Luckily, :Dexter (who reviewed the analysis and most of these patches) has offered to take this up, so I've filed bug 1474295 to formalize it.

Would a chat on IRC or over vidyo be beneficial to help answer any other questions?

[1]: https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.moz_telemetry_doctype_monitor_event.volume.html
[2]: https://pipeline-cep.prod.mozaws.net/dashboard_output/graphs/analysis.moz_telemetry_doctype_monitor_event.size.html
No longer blocks: 1474295
Flags: needinfo?(chutten) → needinfo?(lhenry)
Blocks: 1474295
Comment on attachment 8984292 [details]
bug 1460595 - Remove events from main pings and, thus, TelemetrySession

Let's go ahead, this will make it into beta 7 or 8 depending on how the merge goes today.
Flags: needinfo?(lhenry)
Attachment #8984292 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Hey Liz,

did this end up in beta 7 or 8? Was this released? If so, when? Sorry for the burst of questions :)
Flags: needinfo?(lhenry)
Hi Alessio, No problem!  Yes, I think this is the last set of changes that landed for beta 7, which was released on Tuesday.  You can see the patches that landed between beta 6 and 7 like this, very handy if a little obscure (I do this by just changing the versions in the URL)  https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_62_0b6_RELEASE&tochange=FIREFOX_62_0b7_RELEASE
Flags: needinfo?(lhenry)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #105)
> Hi Alessio, No problem!  Yes, I think this is the last set of changes that
> landed for beta 7, which was released on Tuesday.  You can see the patches
> that landed between beta 6 and 7 like this, very handy if a little obscure
> (I do this by just changing the versions in the URL) 
> https://hg.mozilla.org/releases/mozilla-beta/
> pushloghtml?fromchange=FIREFOX_62_0b6_RELEASE&tochange=FIREFOX_62_0b7_RELEASE

Thanks (sorry for the delay) :)
How is this going? I had a note to follow up with you this week. Any regressions or problems?
Flags: needinfo?(alessio.placitelli)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #107)
> How is this going? I had a note to follow up with you this week. Any
> regressions or problems?

Hey, what a timing! I just closed bug 1474295. This looks ok on Beta, we identified a potential problem which is not a blocker for Beta.
Flags: needinfo?(alessio.placitelli)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: