Closed Bug 1156712 Opened 5 years ago Closed 4 years ago

Re-design ping sending logic

Categories

(Toolkit :: Telemetry, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox40 --- wontfix
firefox41 --- fixed
firefox42 --- fixed

People

(Reporter: gfritzsche, Assigned: gfritzsche)

References

Details

(Whiteboard: [b5] [unifiedTelemetry] [uplift2])

Attachments

(4 files, 14 obsolete files)

2.97 KB, patch
vladan
: review+
Details | Diff | Splinter Review
49.33 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
52.23 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
29.95 KB, patch
gfritzsche
: review+
Dexter
: feedback+
Details | Diff | Splinter Review
In the context of bug 1155604 and others, we need:
* smarter ping-expiry and ping-loading rules
* prioritizing ping sending
* better disk quota management
* better network bandwidth management
* an updated ping send strategy

In the context of this we should also specify the current behavior.
This is probably done best on top of the refactorings from bug 1156359 & bug 1156355
Depends on: 1156355, 1156359
Benjamin & Brendan, can you please review the high-level points from comment 1 and mention any concerns?
Vladan is on PTO next week, so it would be nice to clear this up until tomorrow.
Flags: needinfo?(benjamin)
Flags: needinfo?(bcolloran)
Boy, I don't really know what to say about a lot of this stuff... I guess in response to those high-level points, I will zoom out to even higher thoughts about what our needs are from FHR-- if this is really an edge case that comes up in only a very small minority of pathological cases where a client is generating hundreds of subsession per day, then it's probably fine no matter what we do-- whether we preferentially upload new payloads or old payloads, whether we delete old data a week at a time or a month at time, etc-- it should really not matter that much, because it should happen *rarely*. And so if I think about it from that perspective, I don't have strong opinions about these implementation decisions, which I trust you guys to make.

But I guess what I'm afraid of is that this situation is actually pretty common because something is happening with our rules for generating subsessions or something that is causing us to frequently generate too many pings (more than we can send back to our servers). In that case, these implementation details are also not that important to me, because if we are failing to get a large number of subsession pings back from a large number of clients, then indeed it doesn't matter whether we're uploading oldest or newest first, deleting old pings in batches of weeks or months, etc, because no matter what we'll be creating a backlog faster than we can dig ourselves out, and there will be data loss... in that case, we may need to try things like pulling out the parts of the data that are needed for FHR, or even more extreme architectural changes.

Hopefully we're not in the latter situation, but in order to decide I guess we'll need data. So, whatever ping expiry scheme we adopt, we need to build monitoring for it first to make sure we're not in the situation described above. I would like to know, for example:
* each time a ping is sent, can we report the number of pings in the backlog?
* if we decide to send old pings newest-first, can pings report when they have been sent out of order?
* can we report each time we have to delete old pings?

Since this is critical data integrity stuff, it might be good to monitor it in the Heka layer. And of course, the monitoring mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1129185 can also shed light on subsession chain integrity, so we should make that a priority.
Flags: needinfo?(bcolloran)
I reviewed the doc. I think we're good here.
Flags: needinfo?(benjamin)
(In reply to brendan c from comment #4)
> But I guess what I'm afraid of is that this situation is actually pretty
> common because something is happening with our rules for generating
> subsessions or something that is causing us to frequently generate too many
> pings (more than we can send back to our servers).

We need to implement something soon, so is it ok with you to go with this approach as long as we have the metrics to answer the remaining questions?

> Hopefully we're not in the latter situation, but in order to decide I guess
> we'll need data. So, whatever ping expiry scheme we adopt, we need to build
> monitoring for it first to make sure we're not in the situation described
> above. I would like to know, for example:
> * each time a ping is sent, can we report the number of pings in the backlog?

We can add that the current count of backlogged pings the subsession data. That should be enough, right?

> * if we decide to send old pings newest-first, can pings report when they
> have been sent out of order?

Can you expand on what you mean here?

> * can we report each time we have to delete old pings?

We already have that (count of pending pings that we cleared out): http://telemetry.mozilla.org/#filter=nightly%2F40%2FTELEMETRY_FILES_EVICTED%2Fsaved_session%2FFirefox&aggregates=Submissions&evoOver=Time&locked=true&sanitize=true&renderhistogram=Graph

> Since this is critical data integrity stuff, it might be good to monitor it
> in the Heka layer. And of course, the monitoring mentioned in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1129185 can also shed light on
> subsession chain integrity, so we should make that a priority.

Should we start collecting the desired monitors in an etherpad or google doc?
(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> We need to implement something soon, so is it ok with you to go with this
> approach as long as we have the metrics to answer the remaining questions?

Yes, sorry-- that is what I was trying to say: let's go ahead with this strategy, but let's collect data on possible errors.


> > * each time a ping is sent, can we report the number of pings in the backlog?
> 
> We can add that the current count of backlogged pings the subsession data.
> That should be enough, right?

Yep, perfect.

> > * if we decide to send old pings newest-first, can pings report when they
> > have been sent out of order?
> 
> Can you expand on what you mean here?
> 
> > * can we report each time we have to delete old pings?
> 
> We already have that (count of pending pings that we cleared out):
> http://telemetry.mozilla.org/
> #filter=nightly%2F40%2FTELEMETRY_FILES_EVICTED%2Fsaved_session%2FFirefox&aggr
> egates=Submissions&evoOver=Time&locked=true&sanitize=true&renderhistogram=Gra
> ph

Great.

> Should we start collecting the desired monitors in an etherpad or google doc?

Sure, or maybe in a bug or bugs? As mentioned, there is already this bug:
https://bugzilla.mozilla.org/show_bug.cgi?id=1129185
I don't think I am empowered to prioritize that bug, but as I have said, I would recommend doing so.
Blocks: 1154113
Assignee: nobody → gfritzsche
Status: NEW → ASSIGNED
The expiry logic is taking care of by bug 1162538 & bug 1168835.
Summary: Re-design ping sending & expiry logic → Re-design ping sending logic
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> The expiry logic is taking care of by bug 1162538 & bug 1168835.

Actually, bug 1168835 takes care of part of it (enforcing quota etc.), but we still need to deal with part of the pending ping expiry logic here.
The way we were collecting outgoing pings in tests was pretty hard to follow and update. I got tired of that and cleared it up with a clearer approach.
Attachment #8617878 - Flags: review?(vdjeric)
This redesigns the ping send logic. The retrying on connectivity changes is coming up in a separate patch, the pending ping expiry update & quota management is bug 1168835.
Attachment #8617880 - Flags: review?(vdjeric)
These are pretty big, so I'm going to review them in several passes.
Are you planning to uplift these patches to Aurora 40?
Sorry, I didn't get to this today, I'll try again tomorrow :/
Comment on attachment 8617880 [details] [diff] [review]
Part 2: Re-design the ping sending logic

Review of attachment 8617880 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySend.jsm
@@ +61,5 @@
>  
>  // Timeout after which we consider a ping submission failed.
> +const PING_SUBMIT_TIMEOUT_MS = 1.5 * 60 * 1000;
> +
> +const MS_IN_A_MINUTE = 60 * 1000;

just a suggestion, you could re-use this var when declaring PING_SUBMIT_TIMEOUT_MS and the other constants that have minutes as a unit

@@ +68,5 @@
> +// of pings per minute.
> +const MAX_PING_SENDS_PER_MINUTE = 10;
> +
> +// If we have more pending pings then we can send right now, we schedule the next
> +// send for after SUBMIT_TICK_DELAY.

don't you mean "after SEND_TICK_DELAY"

@@ +72,5 @@
> +// send for after SUBMIT_TICK_DELAY.
> +const SEND_TICK_DELAY = 1 * MS_IN_A_MINUTE;
> +// If we had any ping send failures since the last ping, we use a backoff timeout
> +// for the next ping sends. We increase the delay exponentially up to a limit of
> +// SUBMIT_MAXIMUM_BACKOFF_DELAY_MS.

ditto on the const's name

@@ +94,5 @@
>   */
>  let Policy = {
>    now: () => new Date(),
>    midnightPingFuzzingDelay: () => MIDNIGHT_FUZZING_DELAY_MS,
> +  setSchedulerTickTimeout: (callback, delayMs) => setTimeout(callback, delayMs),

nit: these names seem verbose, how about scheduleNextSend and clearSendTimer

@@ +264,5 @@
> +    this._lastPingSendTimes = [];
> +  },
> +
> +  /**
> +   * Notify the scheduler of a ping being send out. It needs to keep track

typo: sent :)

@@ +322,5 @@
> +      this._schedulerTimer = null;
> +    }
> +  },
> +
> +  _onSchedulerTick: function() {

nit about naming: this module doesn't really "tick"

@@ +340,5 @@
> +    }
> +
> +    // Check if we can actually send any data. If not we don't need to try sending
> +    // or schedule future sends.
> +    if (!TelemetrySendImpl.canSend()) {

put this at the top of the function?

@@ +348,5 @@
> +
> +    // Get a list of pending pings, sorted by last modified, descending.
> +    // We filter the list for the pings that we are currently trying to send.
> +    let pending = TelemetryStorage.getPendingPingList()
> +                    .filter(p => !TelemetrySendImpl.pendingPingRequests.has(p.id));

don't we want to block on server response to a ping submission? maybe with a timeout

@@ +361,5 @@
> +
> +    // Trigger sending as many pings as we are allowed to now.
> +    const sendCount = Math.max(0, MAX_PING_SENDS_PER_MINUTE - this._lastPingSendTimes.length);
> +    let promise = Promise.resolve();
> +    this._log.trace("_onSchedulerTick - sent " + this._lastPingSendTimes.length + " in last minute: " +

don't need the colon

@@ +374,5 @@
> +
> +    if (sendCount > 0) {
> +      this._log.trace("_onSchedulerTick - triggering sending " + sendCount + " pings now");
> +      let sending = pending.slice(0, sendCount);
> +      promise = TelemetrySendImpl.sendPersistedPings([for (p of sending) p.id]);

would they all be persisted pings? what about pings generated during this session?

@@ +583,5 @@
>      if (!this._sendingEnabled || throttled) {
>        // Sending is disabled or throttled, add this to the pending pings.
>        this._log.trace("submitPing - ping is pending, sendingEnabled: " + this._sendingEnabled +
>                        ", throttled: " + throttled);
>        return TelemetryStorage.savePendingPing(ping);

what happens if I opt-out of base and/or extended Telemetry and later I opt-in. Do my old sessions get sent it now?

@@ +640,4 @@
>        return Promise.resolve();
>      }
>  
> +    if (!this.canSend()) {

put this at the top?

@@ +653,4 @@
>        pingSendPromises.push(
> +        TelemetryStorage.loadPendingPing(id)
> +          .then((data) => this._doPing(data, id, true)
> +          .catch(e => this._log.error("sendPersistedPings - _doPing rejected", e))));

would we not want to wait on a response before making requests?
Attachment #8617880 - Flags: review?(vdjeric)
Comment on attachment 8617878 [details] [diff] [review]
Part 1: Clean up test utils for waiting on ping submissions

Review of attachment 8617878 [details] [diff] [review]:
-----------------------------------------------------------------

I'll wait on reviewing test code until we have a chance to discuss on IRC
Attachment #8617878 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #14)
> @@ +348,5 @@
> > +
> > +    // Get a list of pending pings, sorted by last modified, descending.
> > +    // We filter the list for the pings that we are currently trying to send.
> > +    let pending = TelemetryStorage.getPendingPingList()
> > +                    .filter(p => !TelemetrySendImpl.pendingPingRequests.has(p.id));
> 
> don't we want to block on server response to a ping submission? maybe with a
> timeout

Hm, what for exactly?

> @@ +374,5 @@
> > +
> > +    if (sendCount > 0) {
> > +      this._log.trace("_onSchedulerTick - triggering sending " + sendCount + " pings now");
> > +      let sending = pending.slice(0, sendCount);
> > +      promise = TelemetrySendImpl.sendPersistedPings([for (p of sending) p.id]);
> 
> would they all be persisted pings? what about pings generated during this
> session?

Yes, they would all be persisted.
I changed behavior here: we still try to send pings right away.
If we can't send right now, we persist the ping and send it on the next chance (sending pending newest-first).

> @@ +583,5 @@
> >      if (!this._sendingEnabled || throttled) {
> >        // Sending is disabled or throttled, add this to the pending pings.
> >        this._log.trace("submitPing - ping is pending, sendingEnabled: " + this._sendingEnabled +
> >                        ", throttled: " + throttled);
> >        return TelemetryStorage.savePendingPing(ping);
> 
> what happens if I opt-out of base and/or extended Telemetry and later I
> opt-in. Do my old sessions get sent it now?

Yes, good point. That is an oversight by bug 1120379, i filed bug 1174674 on this (we should clear out existing pending pings there).

> @@ +653,4 @@
> >        pingSendPromises.push(
> > +        TelemetryStorage.loadPendingPing(id)
> > +          .then((data) => this._doPing(data, id, true)
> > +          .catch(e => this._log.error("sendPersistedPings - _doPing rejected", e))));
> 
> would we not want to wait on a response before making requests?

I don't understand, do you mean we should switch to sending pings sequentially? We never did that before.
This addresses some issues uncovered by testing and changes the send logic to operate in batches. We send a batch of pings (subject to the 10-pings-per-min limit), wait until they succeeded or failed. If they failed we start backoff behavior, otherwise we try again. New ping submissions and idle-daily triggers reset the backoff behavior.
Attachment #8622561 - Flags: review?(vdjeric)
Attachment #8617880 - Attachment is obsolete: true
This extends the test coverage by introducing a new test focused on TelemetrySend.
Attachment #8622562 - Flags: review?(vdjeric)
This updates the documentation for the changes here.
Attachment #8622564 - Flags: review?(vdjeric)
Attachment #8617878 - Flags: review?(vdjeric)
(In reply to brendan c from comment #7)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> > We need to implement something soon, so is it ok with you to go with this
> > approach as long as we have the metrics to answer the remaining questions?
> 
> Yes, sorry-- that is what I was trying to say: let's go ahead with this
> strategy, but let's collect data on possible errors.
> 
> 
> > > * each time a ping is sent, can we report the number of pings in the backlog?
> > 
> > We can add that the current count of backlogged pings the subsession data.
> > That should be enough, right?
> 
> Yep, perfect.

Per the patches here, this is: ping.payload.info.simpleMeasurements.savedPings
Attachment #8622564 - Flags: review?(vdjeric) → review+
Comment on attachment 8622561 [details] [diff] [review]
Part 2: Re-design the ping sending logic

Review of attachment 8622561 [details] [diff] [review]:
-----------------------------------------------------------------

I'm finding it a bit hard to verify the properties of this code, there's a lot of state and the code itself is starting to show signs of Promise spaghetti again :(

Maybe this review will go faster if I ask you about the intended behaviour first:

1. When the sending of an ad-hoc or persisted ping fails for whatever reason, when does the send get retried?
1.1 When a ping send fails for whatever reason, are there safeties in place that prevent the submission code from spamming with retries?
2. What happens when I submit two new (ad-hoc) pings, one right after the other? And if one or both of the sends fail?
3. What happens when a ping is transmitted (but not acknowledged) right before shutdown begins?
4. When is a persisted ping actually deleted from disk after sending?
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #21)
> I'm finding it a bit hard to verify the properties of this code, there's a
> lot of state and the code itself is starting to show signs of Promise
> spaghetti again :(

FWIW, i think the situation didn't change much here. The only real change is that all sending goes through SendScheduler.triggerSendingPings(), which decides when we can send and implements retry/backoff/... behavior.
 
> Maybe this review will go faster if I ask you about the intended behaviour
> first:
> 
> 1. When the sending of an ad-hoc or persisted ping fails for whatever
> reason, when does the send get retried?

We always send pings in batches. Once a batch completed sending, we check whether we had any failures.
If we had failures (triggerSendingPings() checking on _sendsFailed), we schedule a wait per backoff-timer-behavior.

> 1.1 When a ping send fails for whatever reason, are there safeties in place
> that prevent the submission code from spamming with retries?

We wait on the batch completing and check for send failures, as mentioned above.
A possible spamming scenario is external code continuously submitting new pings as this overrides the backoff behavior (which is the safety mechanism you requested).

> 2. What happens when I submit two new (ad-hoc) pings, one right after the
> other? And if one or both of the sends fail?

Per your request, both bypass existing mechanisms and trigger new sends - the new ping always gets send and we also trigger persisted ping sends up to the max-ping-sends-per-minute limit.

> 3. What happens when a ping is transmitted (but not acknowledged) right
> before shutdown begins?

(that behavior didn't change)
We cancel all outgoing requests (_cancelOutgoingRequests()) to avoid blocking/waiting on them on shutdown.

> 4. When is a persisted ping actually deleted from disk after sending?

(that behavior didn't change)
A persisted ping is deleted if we got a 2XX or 4XX response from the server, on 5XX we trigger retrying - see here:
https://ci.mozilla.org/job/mozilla-central-docs/Tree_Documentation/toolkit/components/telemetry/telemetry/pings.html#submission
OS: Unspecified → All
Hardware: Unspecified → All
Whiteboard: [uplift2]
We decided to make the send logic more linear by isolating it into a separate task with this rough logic:
https://etherpad.mozilla.org/BE4egGB4Eb
Whiteboard: [uplift2] → [b5] [unifiedTelemetry] [uplift2]
Attachment #8617878 - Flags: review?(vdjeric)
Attachment #8622561 - Flags: review?(vdjeric)
Attachment #8622562 - Flags: review?(vdjeric)
Attachment #8617878 - Attachment is obsolete: true
Attachment #8624528 - Flags: review?(vdjeric)
Attachment #8622561 - Attachment is obsolete: true
Attachment #8624529 - Flags: review?(vdjeric)
Attachment #8622562 - Attachment is obsolete: true
Attachment #8624527 - Flags: review?(vdjeric) → review?(alessio.placitelli)
Part 1 is only test-cleanups, moving this over to Alessio to lighten Vladans queue.
Comment on attachment 8624527 [details] [diff] [review]
Part 1: Clean up test utils for waiting on ping submissions

Review of attachment 8624527 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good. I'd just suggest to add that comment for registerPingHandler, but it's optional.

Are you planning on updating test_TelemetryController_idle.js and test_TelemetryEnvironment.js as well? That would be great, but again, not urgent.

::: toolkit/components/telemetry/tests/unit/head.js
@@ +38,5 @@
> +  get started() {
> +    return this._started;
> +  },
> +
> +  registerPingHandler: function(handler) {

I think this could use a could use a little comment to describe the format of handler.

::: toolkit/components/telemetry/tests/unit/test_TelemetryController.js
@@ +299,5 @@
>    // with the timer callback.
>    now = futureDate(now, pingSendTimeout);
>    fakeNow(now);
>    yield pingSendTimerCallback();
> +  const pings = yield PingServer.promiseNextPings(2);

Finally, this looks much better than the old one :)
Attachment #8624527 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8624528 [details] [diff] [review]
Part 2: Re-design the ping sending logic

Review of attachment 8624528 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't review all the paths in TelemetrySend and TelemetryStorage yet, but I think we should discuss the design a bit more in person at Whistler. We can have our own hour-long review/hack session for TelemetrySend :)

::: toolkit/components/telemetry/TelemetrySend.jsm
@@ +295,5 @@
> +  // The current retry delay after ping send failures. We use this for the exponential backoff,
> +  // increasing this value everytime we had send failures since the last tick.
> +  _currentTickTimeout: SEND_TICK_DELAY,
> +  // This is a sliding window of the last N ping send timestamps.
> +  // We need to track this to uphold a limit on how many pings we send per minute.

i'm wondering if the sliding window is a bit too precise? We only really need to throttle on startup. We're expecting a potential flurry of sends on startup and then loooong periods of idleness interrupted by the occasional environment-changed ping.

So instead of using a sliding window, we could track the time when the last batch was sent and just limit sending to 1 batch per minute. 

Then:

If now < lastBatch + 1 minute:
  try again in a minute
else:
  lastBatch = now
  send up to 10 pings
  try again in a minute if there are more than 10 pings

idle-daily and ad-hoc pings would behave like that last else branch.

Since it's not as precise as a sliding window, it won't enforce a max of exactly 10 pings/minute at all times, but that's not a hard requirement and the code might be simpler.
What do you think?

@@ +374,5 @@
> +
> +    return this._sendTask;
> +  },
> +
> +  _sendTaskLogic: Task.async(function*() {

I wouldn't expect sendTaskLogic to be a task

@@ +378,5 @@
> +  _sendTaskLogic: Task.async(function*() {
> +    this._currentTickTimeout = SEND_TICK_DELAY;
> +    this._sendsFailed = false;
> +
> +    for (;;) {

can't this be a while loop conditioned on non-zero outstanding pings (pending + current + in-progress)?

@@ +418,5 @@
> +      // If we have no outgoing pings we hit the max-pings-per-minute limit, wait for a bit.
> +      if (!haveOutgoingPings) {
> +        this._log.trace("_sendTaskLogic - hit ping-per-min limit, delaying ping send to " +
> +                        new Date(now.getTime() + SEND_TICK_DELAY));
> +        yield CancellableTimeout.promiseWaitOnTimeout(SEND_TICK_DELAY);

could we fake a sleep(1000) here? :)

@@ +449,5 @@
> +          continue;
> +        }
> +
> +        const newTimeout = Math.min(SEND_MAXIMUM_BACKOFF_DELAY_MS,
> +                                this._currentTickTimeout * 2);

nit: alignment

@@ +557,3 @@
>      // Check the pending pings on disk now.
> +    yield this._checkPendingPings();
> +    SendScheduler.triggerSendingPings(true);

we won't send any pending pings until we're done clearing out old pings? I think sending of pending pings is higher priority than clearing of expired pending pings

@@ +560,4 @@
>    }),
>  
>    /**
>     * Discard old pings from the pending pings and detect overdue ones.

I thought ping expiry (and ping maintenance in general) were going to be done passively in TelemetryStorage? It seems like it would result in a nicer separation & layering of responsibilities. What do you think?

@@ +670,5 @@
> +    // As a safety mechanism, this resets any currently active throttling.
> +    this._log.trace("submitPing - can send pings, trying to send now");
> +    this._currentPings.set(ping.id, ping);
> +    SendScheduler.triggerSendingPings(true);
> +    return this.promisePendingPingActivity();

this isn't pretty.. 
do callers need a promise? can't submitPing be fire and forget? it's not like the caller can do anything about their ping not going out
if we do need a promise, can't we have one promise for each queued current ping?

@@ +924,3 @@
>     * Check if pings can be sent to the server. If FHR is not allowed to upload,
>     * pings are not sent to the server (Telemetry is a sub-feature of FHR). If trying
>     * to send a deletion ping, don't block it.

i wonder if we should treat the deletion ping differently.. i.e. have it not be a real Telemetry ping and just have it be an XHR sent from about:preferences or something.

it would be a bit weird to send information about your system configuration together with a message telling Mozilla to stop collecting Telemetry in your browser

What happens if I have Telemetry enabled, then I send a deletion ping and keep my Firefox session running. Would I continue submitting pings for the remainder of the session?

@@ +925,5 @@
>     * pings are not sent to the server (Telemetry is a sub-feature of FHR). If trying
>     * to send a deletion ping, don't block it.
>     * If unified telemetry is off, don't send pings if Telemetry is disabled.
>     *
>     * @param {Object} [ping=null] A ping to be checked.

hmm, it's a bit odd that Telemetry's canSend depends on the ping itself

@@ +933,2 @@
>      // We only send pings from official builds, but allow overriding this for tests.
>      if (!Telemetry.isOfficialTelemetry && !this._testMode) {

hmm, why is this not a startup-time check?
i'm not against checking twice just in case, but I figured build-time variables would be checked at Telemetry initialization time (i.e. TelemetrySend would not even be launched)

@@ +936,5 @@
>      }
>  
>      // With unified Telemetry, the FHR upload setting controls whether we can send pings.
>      // The Telemetry pref enables sending extended data sets instead.
>      if (IS_UNIFIED_TELEMETRY) {

Wouldn't it be better to keep the canSend logic in TelemetryController next to the logic for canRecord?
Seems like treating the deletion ping as just another ping, and then special-casing its sending in TelemetrySend.jsm just makes things more complicated

@@ +965,5 @@
>     *         are resolved.
>     */
>    promisePendingPingActivity: function () {
>      this._log.trace("promisePendingPingActivity - Waiting for ping task");
>      return Promise.all([for (p of this._pendingPingActivity) p.catch(ex => {

this method is nasty ;) if a caller submits a ping for sending why should the promise remain unfulfilled until all pings in the current buffer are sent or fail sending

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +1036,5 @@
>      return this._scanPendingPingsTask;
>    },
>  
>    getPendingPingList: function() {
>      return this._buildPingList();

the function names are little confusing with getPendingPingList, _buildPingList, _scanPendingPings, _checkPendingPings, loadPendingPingList. I found it hard to infer based on names how these functions are different and how they relate to each other

I think this it: _checkPendingPings -> loadPendingPingList -> _scanPendingPings -> _buildPingList and
getPendingPingList -> _buildPingList

We could add documentation for these functions, but I think what we really need are better function names.

can you also add a comment above the functions that return a promise and explain what the promise represents?

And let's try to minimize the amount of promise-using code :)

@@ +1039,5 @@
>    getPendingPingList: function() {
>      return this._buildPingList();
>    },
>  
>    _scanPendingPings: Task.async(function*() {

Does the saved-telemetry-pings directory still get re-created if it's missing?
Attachment #8624528 - Flags: review?(vdjeric) → review-
Attachment #8624529 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #29)
> I didn't review all the paths in TelemetrySend and TelemetryStorage yet, but
> I think we should discuss the design a bit more in person at Whistler. We
> can have our own hour-long review/hack session for TelemetrySend :)

This is blocking the unified telemetry schedule, so i was hoping we can keep this more actionable & oriented towards finishing it up.
We will need some confirmation time on Nightly and Beta 5 (July 16) is the cutoff. We shouldn't uplift this at the last minute either.

> ::: toolkit/components/telemetry/TelemetrySend.jsm
> @@ +295,5 @@
> > +  // The current retry delay after ping send failures. We use this for the exponential backoff,
> > +  // increasing this value everytime we had send failures since the last tick.
> > +  _currentTickTimeout: SEND_TICK_DELAY,
> > +  // This is a sliding window of the last N ping send timestamps.
> > +  // We need to track this to uphold a limit on how many pings we send per minute.
> 
> i'm wondering if the sliding window is a bit too precise? We only really
> need to throttle on startup. We're expecting a potential flurry of sends on
> startup and then loooong periods of idleness interrupted by the occasional
> environment-changed ping.
> 
> So instead of using a sliding window, we could track the time when the last
> batch was sent and just limit sending to 1 batch per minute. 
> 
> Then:
> 
> If now < lastBatch + 1 minute:
>   try again in a minute
> else:
>   lastBatch = now
>   send up to 10 pings
>   try again in a minute if there are more than 10 pings
> 
> idle-daily and ad-hoc pings would behave like that last else branch.
> 
> Since it's not as precise as a sliding window, it won't enforce a max of
> exactly 10 pings/minute at all times, but that's not a hard requirement and
> the code might be simpler.
> What do you think?

This was mostly implementing your suggestions - it seems fine with me in principle, the only concern would be external ping submissions spamming here.
Is that ok with you?

> @@ +378,5 @@
> > +  _sendTaskLogic: Task.async(function*() {
> > +    this._currentTickTimeout = SEND_TICK_DELAY;
> > +    this._sendsFailed = false;
> > +
> > +    for (;;) {
> 
> can't this be a while loop conditioned on non-zero outstanding pings
> (pending + current + in-progress)?

That means tracking data in some local variable across iterations, keeping no local state across iterations seems much cleaner.

> @@ +418,5 @@
> > +      // If we have no outgoing pings we hit the max-pings-per-minute limit, wait for a bit.
> > +      if (!haveOutgoingPings) {
> > +        this._log.trace("_sendTaskLogic - hit ping-per-min limit, delaying ping send to " +
> > +                        new Date(now.getTime() + SEND_TICK_DELAY));
> > +        yield CancellableTimeout.promiseWaitOnTimeout(SEND_TICK_DELAY);
> 
> could we fake a sleep(1000) here? :)

I don't understand, that's basically what it's doing? Keep in mind that this needs to be cancellable from the outside, so (1) we won't block on long waits on shutdown here and (2) we can cancel this on new ping submissions.

> @@ +557,3 @@
> >      // Check the pending pings on disk now.
> > +    yield this._checkPendingPings();
> > +    SendScheduler.triggerSendingPings(true);
> 
> we won't send any pending pings until we're done clearing out old pings? I
> think sending of pending pings is higher priority than clearing of expired
> pending pings

This makes things much easier to reason about and test (races abound...), is that a big problem?
E.g. clearing out old pings races against sending them out.

> @@ +560,4 @@
> >    }),
> >  
> >    /**
> >     * Discard old pings from the pending pings and detect overdue ones.
> 
> I thought ping expiry (and ping maintenance in general) were going to be
> done passively in TelemetryStorage? It seems like it would result in a nicer
> separation & layering of responsibilities. What do you think?

I separated out the overhaul of this part into bug 1168835, it is blocked by this bug.

> @@ +670,5 @@
> > +    // As a safety mechanism, this resets any currently active throttling.
> > +    this._log.trace("submitPing - can send pings, trying to send now");
> > +    this._currentPings.set(ping.id, ping);
> > +    SendScheduler.triggerSendingPings(true);
> > +    return this.promisePendingPingActivity();
> 
> this isn't pretty.. 
> do callers need a promise? can't submitPing be fire and forget? it's not
> like the caller can do anything about their ping not going out
> if we do need a promise, can't we have one promise for each queued current
> ping?

I don't like it too much now, but i needed to adjust testing without rewriting everything.
Reworking this completely won't be quick.

> @@ +924,3 @@
> >     * Check if pings can be sent to the server. If FHR is not allowed to upload,
> >     * pings are not sent to the server (Telemetry is a sub-feature of FHR). If trying
> >     * to send a deletion ping, don't block it.
> 
> i wonder if we should treat the deletion ping differently.. i.e. have it not
> be a real Telemetry ping and just have it be an XHR sent from
> about:preferences or something.

Consider the server being temporarily not reachable. Then we end up duplicating all the logic for persistance, retries etc.

> it would be a bit weird to send information about your system configuration
> together with a message telling Mozilla to stop collecting Telemetry in your
> browser

What do you mean with that? If you disable upload we should only send out this ping.

> What happens if I have Telemetry enabled, then I send a deletion ping and
> keep my Firefox session running. Would I continue submitting pings for the
> remainder of the session?

With the changes here only the deletion ping - the problem is that we may have failed to send the deletion ping earlier, i noticed that we fail to send the persisted deletion ping in this case (kind of mixed-in fix with the changes here, but oh well).
FWIW, this is addressing an issue on the persistance code path, not adding new behavior.

> @@ +925,5 @@
> >     * pings are not sent to the server (Telemetry is a sub-feature of FHR). If trying
> >     * to send a deletion ping, don't block it.
> >     * If unified telemetry is off, don't send pings if Telemetry is disabled.
> >     *
> >     * @param {Object} [ping=null] A ping to be checked.
> 
> hmm, it's a bit odd that Telemetry's canSend depends on the ping itself

Not a new issue, let's take it to a different bug if this needs discussion.

> @@ +933,2 @@
> >      // We only send pings from official builds, but allow overriding this for tests.
> >      if (!Telemetry.isOfficialTelemetry && !this._testMode) {
> 
> hmm, why is this not a startup-time check?
> i'm not against checking twice just in case, but I figured build-time
> variables would be checked at Telemetry initialization time (i.e.
> TelemetrySend would not even be launched)

Dito.

> @@ +936,5 @@
> >      }
> >  
> >      // With unified Telemetry, the FHR upload setting controls whether we can send pings.
> >      // The Telemetry pref enables sending extended data sets instead.
> >      if (IS_UNIFIED_TELEMETRY) {
> 
> Wouldn't it be better to keep the canSend logic in TelemetryController next
> to the logic for canRecord?
> Seems like treating the deletion ping as just another ping, and then
> special-casing its sending in TelemetrySend.jsm just makes things more
> complicated

Dito.

> @@ +965,5 @@
> >     *         are resolved.
> >     */
> >    promisePendingPingActivity: function () {
> >      this._log.trace("promisePendingPingActivity - Waiting for ping task");
> >      return Promise.all([for (p of this._pendingPingActivity) p.catch(ex => {
> 
> this method is nasty ;) if a caller submits a ping for sending why should
> the promise remain unfulfilled until all pings in the current buffer are
> sent or fail sending

We need something to hang tests off.

> ::: toolkit/components/telemetry/TelemetryStorage.jsm
> @@ +1036,5 @@
> >      return this._scanPendingPingsTask;
> >    },
> >  
> >    getPendingPingList: function() {
> >      return this._buildPingList();
> 
> the function names are little confusing with getPendingPingList,
> _buildPingList, _scanPendingPings, _checkPendingPings, loadPendingPingList.
> I found it hard to infer based on names how these functions are different
> and how they relate to each other
> 
> I think this it: _checkPendingPings -> loadPendingPingList ->
> _scanPendingPings -> _buildPingList and
> getPendingPingList -> _buildPingList
> 
> We could add documentation for these functions, but I think what we really
> need are better function names.
> 
> can you also add a comment above the functions that return a promise and
> explain what the promise represents?
> 
> And let's try to minimize the amount of promise-using code :)

Not a new issue, follow-up bug?

> @@ +1039,5 @@
> >    getPendingPingList: function() {
> >      return this._buildPingList();
> >    },
> >  
> >    _scanPendingPings: Task.async(function*() {
> 
> Does the saved-telemetry-pings directory still get re-created if it's
> missing?

Yes, see TelemetryStorage.jsm getPingDirectory()
Flags: needinfo?(vdjeric)
I think this bug is about cleaning up existing code, so it seems on-topic to bring up other ways to clean up the same module? I'm not sure if it makes sense to rush clean-up just to be able to uplift it.

There's a lot of stuff to discuss & clarify in the comments above, so let's talk during the workweek. Let me know when you're available (IRC or in-person)
Flags: needinfo?(vdjeric)
The goal here is to implement a new outgoing ping handling scheme, which affects the unified Telemetry shipping plan (with the ping expiry/eviction moved to another bug).
We should discuss other issues with the existing code too as they come up, but i don't think we should block this bug on them.
Outcome of discussing this bug in person in Whistler:
* send limit: simpler 1-batch-per-minute throttling as suggested above, reset by ad-hoc pings
* _sendTaskLogic(): rename to _doSendTask()
* document the functions for retrieving pending pings properly, use more different names (fetch, read, ...?)
* for bug 1168835: do block on purging old pings, but not on enforcing the storage quota
* submitPing(): fix documentation of this and submitExternalPing() so external callers don't rely on the returned promise
* deletion ping: lets keep the canSend fixes, file new bug on whether we still need it and whether we can do it differently
Attachment #8626258 - Flags: review?(vdjeric)
Attachment #8624528 - Attachment is obsolete: true
Attachment #8626260 - Flags: review?(vdjeric)
Attachment #8626258 - Attachment is obsolete: true
Attachment #8626258 - Flags: review?(vdjeric)
Attachment #8626261 - Flags: review?(vdjeric)
Attachment #8624529 - Attachment is obsolete: true
Attachment #8626266 - Flags: review?(vdjeric)
Attachment #8626260 - Attachment is obsolete: true
Attachment #8626260 - Flags: review?(vdjeric)
Comment on attachment 8626266 [details] [diff] [review]
Part 2: Re-design the ping sending logic

Review of attachment 8626266 [details] [diff] [review]:
-----------------------------------------------------------------

1. Should scanPendingPing list even return a ping list? It normally returns buildPingList() which is fine, but if there's an error, it will always return [] even though _pendingPings could be non-empty (from throttled or re-try ad-hoc ping). We could have callers just called buildPingList() or some other method directly instead and have scanPendingPings just be a fetcher

*i may have written some incorrect variable/function names above :)

2. Is there a bug filed for centralizing shutdown (& AsyncShutdown blocking) only on TelemetryController?

::: toolkit/components/telemetry/TelemetryController.jsm
@@ +195,5 @@
>    submitExternalPing: function(aType, aPayload, aOptions = {}) {
>      aOptions.addClientId = aOptions.addClientId || false;
>      aOptions.addEnvironment = aOptions.addEnvironment || false;
>  
>      return Impl.submitExternalPing(aType, aPayload, aOptions);

let testOnly = Impl.submitExternalPing(aType, aPayload, aOptions);
return testOnly;

::: toolkit/components/telemetry/TelemetrySend.jsm
@@ +70,5 @@
> +// of pings per minute.
> +const MAX_PING_SENDS_PER_MINUTE = 10;
> +
> +// If we have more pending pings then we can send right now, we schedule the next
> +// send for after SUBMIT_TICK_DELAY.

SUBMIT_ -> SEND_

@@ +74,5 @@
> +// send for after SUBMIT_TICK_DELAY.
> +const SEND_TICK_DELAY = 1 * MS_IN_A_MINUTE;
> +// If we had any ping send failures since the last ping, we use a backoff timeout
> +// for the next ping sends. We increase the delay exponentially up to a limit of
> +// SEND_MAXIMUM_BACKOFF_DELAY_MS.

// An external ping submit will reset this exponential backoff.

@@ +84,4 @@
>  
>  // Files that are older than OVERDUE_PING_FILE_AGE, but younger than
>  // MAX_PING_FILE_AGE indicate that we need to send all of our pings ASAP.
> +const OVERDUE_PING_FILE_AGE = 7 * 24 * 60 * MS_IN_A_MINUTE; // 1 week

You should still report this # to Telemetry

@@ +541,2 @@
>      // Check the pending pings on disk now.
> +    yield this._checkPendingPings();

we dont' want to bail here if e.g. we are unable to delete a ping from disk cause then we won't get to send pending pings and it might even end up being a persistent startup failure!

could we add a test for this too? i.e. make sure Telemetry still initializes if it e.g. fails to delete an expired pending ping

also check the rest of the Telemetry initialization paths for this pattern

@@ +588,5 @@
>        }
>      }
>  
>      Services.telemetry.getHistogramById('TELEMETRY_FILES_EVICTED')
>                        .add(evictedCount);

evictedCount never gets incremented, only this._evictedPingsCount

@@ -341,5 @@
>                        .add(evictedCount);
> -
> -    // Check for overdue pings.
> -    const overduePings = infos.filter((info) =>
> -      (now.getTime() - info.lastModificationDate) > OVERDUE_PING_FILE_AGE);

so currently (before this patch), we would delete pings based on MAX_PINGS and only *then* we would count how many pings were older than 2 weeks.

this under-reports how old the pings on disk are because we delete them first based on the too-many-pings criteria first.

we should have a follow up bug (with new probes) that first report the status of the ping cache (how many pings found overdue, how many expired, total count) before we delete any pings

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ +1063,5 @@
> +      let info;
> +      try {
> +        info = yield OS.File.stat(file.path);
> +      } catch (ex) {
> +        this._log.error("_scanPendingPings - failed to state file " + file.path, ex);

state -> stat
Blocks: 1177736
Blocks: 1177762
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #38)
> Comment on attachment 8626266 [details] [diff] [review]
> Part 2: Re-design the ping sending logic
> 
> Review of attachment 8626266 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> 1. Should scanPendingPing list even return a ping list? It normally returns
> buildPingList() which is fine, but if there's an error, it will always
> return [] even though _pendingPings could be non-empty (from throttled or
> re-try ad-hoc ping). We could have callers just called buildPingList() or
> some other method directly instead and have scanPendingPings just be a
> fetcher
> 
> *i may have written some incorrect variable/function names above :)
> 
> 2. Is there a bug filed for centralizing shutdown (& AsyncShutdown blocking)
> only on TelemetryController?

This is mostly done for both init and shutdown, the only thing missing is TelemetrySession.
Fixing that will be part of bug 1156358.

> @@ +588,5 @@
> >        }
> >      }
> >  
> >      Services.telemetry.getHistogramById('TELEMETRY_FILES_EVICTED')
> >                        .add(evictedCount);
> 
> evictedCount never gets incremented, only this._evictedPingsCount
> 
> @@ -341,5 @@
> >                        .add(evictedCount);
> > -
> > -    // Check for overdue pings.
> > -    const overduePings = infos.filter((info) =>
> > -      (now.getTime() - info.lastModificationDate) > OVERDUE_PING_FILE_AGE);
> 
> so currently (before this patch), we would delete pings based on MAX_PINGS
> and only *then* we would count how many pings were older than 2 weeks.
> 
> this under-reports how old the pings on disk are because we delete them
> first based on the too-many-pings criteria first.
> 
> we should have a follow up bug (with new probes) that first report the
> status of the ping cache (how many pings found overdue, how many expired,
> total count) before we delete any pings

Bug 1177762.
(In reply to Georg Fritzsche [:gfritzsche] from comment #39)
> > 1. Should scanPendingPing list even return a ping list? It normally returns
> > buildPingList() which is fine, but if there's an error, it will always
> > return [] even though _pendingPings could be non-empty (from throttled or
> > re-try ad-hoc ping). We could have callers just called buildPingList() or
> > some other method directly instead and have scanPendingPings just be a
> > fetcher

I think it's fine as it is now. This actually only has one caller now (TelemetrySend), which is better off with [] on shutdown.
Callers should wait on the async scanning here, we actually need this for defined behavior.
For the other cases we have "getPendingPingList()" (yeah, not the best naming i guess).
Attachment #8626829 - Flags: review?(vdjeric)
Attachment #8626266 - Attachment is obsolete: true
Attachment #8626266 - Flags: review?(vdjeric)
Comment on attachment 8626829 [details] [diff] [review]
Part 2: Re-design the ping sending logic

Review of attachment 8626829 [details] [diff] [review]:
-----------------------------------------------------------------

i reviewed the Firefox code, the test code will have to wait for next week

::: toolkit/components/telemetry/TelemetrySend.jsm
@@ +247,5 @@
>      return TelemetrySendImpl.promisePendingPingActivity();
>    },
>  };
>  
> +let CancellableTimeout = {

file a bug to put this in some common JS library later

@@ +267,5 @@
> +
> +    return this._deferred.promise;
> +  },
> +
> +  _onTimeout: function() {

check for this._deferred nullness here

onTimeout could get queued to the main thread's queue while the event that is going to call cancelTimeout is running?

@@ +288,5 @@
> + * SendScheduler implements the timer & scheduling behavior for ping sends.
> + */
> +let SendScheduler = {
> +  // The timer which drives the scheduler.
> +  _schedulerTimer: null,

this is an unused variable?

@@ +294,5 @@
> +  // backoff timeout.
> +  _sendsFailed: false,
> +  // The current retry delay after ping send failures. We use this for the exponential backoff,
> +  // increasing this value everytime we had send failures since the last tick.
> +  _currentTickTimeout: SEND_TICK_DELAY,

_currentTickTimeout -> _backoffDelay

@@ +330,5 @@
> +  /**
> +   * Notify the scheduler of a failure in sending out pings that warrants retrying.
> +   * This will trigger the exponential backoff timer behavior on the next tick.
> +   */
> +  sendFailed: function() {

notifySendsFailed()?

@@ +383,5 @@
> +      // Get a list of pending pings, sorted by last modified, descending.
> +      // Filter out all the pings we can't send now. This addresses scenarios like "deletion" pings
> +      // which can be send even when upload is disabled.
> +      let pending = TelemetryStorage.getPendingPingList();
> +      let current = TelemetrySendImpl.currentPendingPings();

a humble naming suggestion :)

getPendingPingList -> getPersistedPings()
currentPendingPings -> getUnpersistedPings()

especially given that you have pendingPingCount()

@@ +385,5 @@
> +      // which can be send even when upload is disabled.
> +      let pending = TelemetryStorage.getPendingPingList();
> +      let current = TelemetrySendImpl.currentPendingPings();
> +      this._log.trace("_doSendTask - pending: " + pending.length + ", current: " + current.length);
> +      pending = pending.filter(p => TelemetrySendImpl.canSend(p));

what if there are multiple kill pings on disk, do we want to send them? 

If so, do we also want to notify servers with a Telemetry ping when a user has re-enabled Telemetry reporting?

maybe a question for bcolloran? not sure what data he needs to measure & track opt-out rates

@@ +425,5 @@
> +        return;
> +      }
> +
> +      const sendTime = Policy.now() - sendStartTime;
> +      let timeout = Math.max(0, SEND_TICK_DELAY - sendTime);

- add a comment that this is the normal path.. i.e. if ping sending succeeds, then you send the next batch up to 1 minute after you started sending the last batch
- and may rename timeout to nextSendDelay?

@@ +432,5 @@
> +        this._log.trace("_doSendTask - had no send failures, resetting backoff timer");
> +        resetBackoffTimer();
> +      } else {
> +        const newTimeout = Math.min(SEND_MAXIMUM_BACKOFF_DELAY_MS,
> +                                this._currentTickTimeout * 2);

indentation?

@@ +602,3 @@
>     }),
>  
>    shutdown: Task.async(function*() {

make sure you monitor Yoric's shutdown-hangs dashboard for reports of Telemetry shutdown hangs

@@ +732,5 @@
>      }
>  
> +    if (pingIds.length < 1) {
> +      this._log.trace("sendPersistedPings - no pings to send");
> +      return;

why not Promise.resolve() here

@@ +744,4 @@
>        pingSendPromises.push(
> +        TelemetryStorage.loadPendingPing(id)
> +          .then((data) => this._doPing(data, id, true))
> +          .catch(e => this._log.error("sendPersistedPings - failed to send ping " + id, e)));

add a comment about where send failures get handled

@@ +806,5 @@
>      let slug = pathComponents.join("/");
>      return "/submit/telemetry/" + slug;
>    },
>  
>    _doPing: function(ping, id, isPersisted) {

we'll have to clean this function up on day :(
Attachment #8626829 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #42)
> ::: toolkit/components/telemetry/TelemetrySend.jsm
> @@ +247,5 @@
> >      return TelemetrySendImpl.promisePendingPingActivity();
> >    },
> >  };
> >  
> > +let CancellableTimeout = {
> 
> file a bug to put this in some common JS library later

Bug 1178205.

> @@ +383,5 @@
> > +      // Get a list of pending pings, sorted by last modified, descending.
> > +      // Filter out all the pings we can't send now. This addresses scenarios like "deletion" pings
> > +      // which can be send even when upload is disabled.
> > +      let pending = TelemetryStorage.getPendingPingList();
> > +      let current = TelemetrySendImpl.currentPendingPings();
> 
> a humble naming suggestion :)
> 
> getPendingPingList -> getPersistedPings()
> currentPendingPings -> getUnpersistedPings()

I renamed TelemetrySendImpl.currentPendingPings() to TelemetrySendImpl.getUnpersistedPings().
TelemetryStorage.getPendingPingList() seems fine, as this a function on our storage module - the function there should definitely mention pending pings as that module also handles archived ping storage etc.
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #42)
> @@ +385,5 @@
> > +      // which can be send even when upload is disabled.
> > +      let pending = TelemetryStorage.getPendingPingList();
> > +      let current = TelemetrySendImpl.currentPendingPings();
> > +      this._log.trace("_doSendTask - pending: " + pending.length + ", current: " + current.length);
> > +      pending = pending.filter(p => TelemetrySendImpl.canSend(p));
> 
> what if there are multiple kill pings on disk, do we want to send them? 
> 
> If so, do we also want to notify servers with a Telemetry ping when a user
> has re-enabled Telemetry reporting?
> 
> maybe a question for bcolloran? not sure what data he needs to measure &
> track opt-out rates

Brendan & Mark, this is kind of a corner case - if a user opts out of FHR submissions, we will send a "deletion" ping (which request deleting or unlinking server-side data).
If a user opts out of FHR submissions without having network connections, later opts in again and then opts out again - do we need each of the individual deletion pings?
Is there any downside to sending each or only sending the last?
Flags: needinfo?(mreid)
Flags: needinfo?(bcolloran)
Comment on attachment 8626261 [details] [diff] [review]
Part 3 - Extend the Telemetry send test coverage

Moving review - Vladan is a bit busy and Yoric is on PTO at least today.
Please check this against the send logic requirements and any potential failure scenarios.
Does this provide adequate coverage?
And does this seem sane for doing tests?
Attachment #8626261 - Flags: review?(vdjeric) → review?(alessio.placitelli)
Attachment #8627123 - Flags: review?(vdjeric)
Attachment #8626829 - Attachment is obsolete: true
Comment on attachment 8626261 [details] [diff] [review]
Part 3 - Extend the Telemetry send test coverage

Review of attachment 8626261 [details] [diff] [review]:
-----------------------------------------------------------------

In part 2 the send task "Filters out all the pings we can't send now. This addresses scenarios like "deletion" pings". Does this require a separate test or is it tested elsewhere?

::: toolkit/components/telemetry/TelemetryStorage.jsm
@@ -55,5 @@
>  
>  const UUID_REGEX = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i;
>  
> -let isPingDirectoryCreated = false;
> -

Why are we removing this?

@@ +1073,5 @@
>        }
>  
>        this._pendingPings.set(id, {
>          path: file.path,
> +        lastModificationDate: info.lastModificationDate.getTime(),

This belongs to part 2, since you're modifying the file there.

@@ +1246,5 @@
>  function getPingDirectory() {
>    return Task.spawn(function*() {
>      let directory = TelemetryStorage.pingDirectoryPath;
>  
> +    if (!(yield OS.File.exists(directory))) {

Why are you restoring this?

@@ -1253,2 @@
>        yield OS.File.makeDir(directory, { unixMode: OS.Constants.S_IRWXU });
> -      isPingDirectoryCreated = true;

Same here.

::: toolkit/components/telemetry/tests/unit/head.js
@@ +263,5 @@
> +  let module = Cu.import("resource://gre/modules/TelemetryController.jsm");
> +  module.Policy.generatePingId = func;
> +}
> +
> +

Extra newline. Please remove one.

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ +19,5 @@
> +
> +const MS_IN_A_MINUTE = 60 * 1000;
> +
> +function run_test() {
> +  do_get_profile(true);

Any reason for this argument to be true? Do we need "profile-after-change" to be notified here?

@@ +37,5 @@
> +  const path = OS.Path.join(TelemetryStorage.pingDirectoryPath, id);
> +  return OS.File.setDates(path, null, timestamp);
> +}
> +
> +add_task(function* test_sendPendingPings() {

This test is ~140 lines long. I think it could be a good idea to add a *brief* comment here to outline its purpose. This would make it easier to follow the test code, giving a bit of background.

@@ +100,5 @@
> +  yield TelemetryController.reset();
> +  let [pingSendTimerCallback, pingSendTimeout] = yield timerPromise;
> +  Assert.ok(!!pingSendTimerCallback, "Should have a timer callback");
> +
> +  // We should have received 10 pings from the first send batch:

Could we please state that we expect newest pings to be received first?

@@ +147,5 @@
> +  countByType = countPingTypes(pings);
> +
> +  Assert.equal(countByType.get(TEST_TYPE_A), 5, "Should have received the correct amount of type A pings");
> +
> +  // Failing a ping send now should trigger backoff behavior.

Is there any particular reason why the backoff testing isn't a separate test? It would help the comprehension.

@@ +157,5 @@
> +  yield TelemetryController.submitExternalPing(TEST_TYPE_C, {});
> +  [pingSendTimerCallback, pingSendTimeout] = yield timerPromise;
> +  Assert.equal(TelemetrySend.pendingPingCount, 1, "Should have one pending ping.");
> +
> +  const MAX_TIMEOUT = 120 * MS_IN_A_MINUTE;

Mh, I'd suggest commenting about the 120 or renaming MAX_TIMEOUT to something referencing the backoff.
Attachment #8626261 - Flags: review?(alessio.placitelli) → feedback+
(In reply to Alessio Placitelli [:Dexter] from comment #47)
> In part 2 the send task "Filters out all the pings we can't send now. This
> addresses scenarios like "deletion" pings". Does this require a separate
> test or is it tested elsewhere?

That happens to fix an existing bug and is self-contained - it can go in a separate patch or bug.

> ::: toolkit/components/telemetry/TelemetryStorage.jsm
> @@ -55,5 @@
> >  
> >  const UUID_REGEX = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i;
> >  
> > -let isPingDirectoryCreated = false;
> > -
> 
> Why are we removing this?

Probably because i uncovered a bug with stale state which was fixed by always checking whether the directory exists.

> @@ +1073,5 @@
> >        }
> >  
> >        this._pendingPings.set(id, {
> >          path: file.path,
> > +        lastModificationDate: info.lastModificationDate.getTime(),
> 
> This belongs to part 2, since you're modifying the file there.

Part 2 runs fine on its own, this new test and the changes here are self-contained. We don't need to keep the changes to one file all in one patch.

> @@ +1246,5 @@
> >  function getPingDirectory() {
> >    return Task.spawn(function*() {
> >      let directory = TelemetryStorage.pingDirectoryPath;
> >  
> > +    if (!(yield OS.File.exists(directory))) {
> 
> Why are you restoring this?

What do you mean with restoring?

> ::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
> @@ +19,5 @@
> > +
> > +const MS_IN_A_MINUTE = 60 * 1000;
> > +
> > +function run_test() {
> > +  do_get_profile(true);
> 
> Any reason for this argument to be true? Do we need "profile-after-change"
> to be notified here?

See the normal startup path in TelemetryStartup.js.
(In reply to Georg Fritzsche [:gfritzsche] from comment #48)
> (In reply to Alessio Placitelli [:Dexter] from comment #47)
> > In part 2 the send task "Filters out all the pings we can't send now. This
> > addresses scenarios like "deletion" pings". Does this require a separate
> > test or is it tested elsewhere?
> 
> That happens to fix an existing bug and is self-contained - it can go in a
> separate patch or bug.

Good then!

> > ::: toolkit/components/telemetry/TelemetryStorage.jsm
> > @@ -55,5 @@
> > >  
> > >  const UUID_REGEX = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i;
> > >  
> > > -let isPingDirectoryCreated = false;
> > > -
> > 
> > Why are we removing this?
> 
> Probably because i uncovered a bug with stale state which was fixed by
> always checking whether the directory exists.

Could you expand a bit on why this change is needed (along with the others in TelemetryStorage.jsm)? What's the bug you probably uncovered?

> > @@ +1073,5 @@
> > >        }
> > >  
> > >        this._pendingPings.set(id, {
> > >          path: file.path,
> > > +        lastModificationDate: info.lastModificationDate.getTime(),
> > 
> > This belongs to part 2, since you're modifying the file there.
> 
> Part 2 runs fine on its own, this new test and the changes here are
> self-contained. We don't need to keep the changes to one file all in one
> patch.

Indeed we don't, but shouldn't part 2 be broken because of this? Anyway, it's fine for me to keep it in this patch.

> > @@ +1246,5 @@
> > >  function getPingDirectory() {
> > >    return Task.spawn(function*() {
> > >      let directory = TelemetryStorage.pingDirectoryPath;
> > >  
> > > +    if (!(yield OS.File.exists(directory))) {
> > 
> > Why are you restoring this?
> 
> What do you mean with restoring?

I mean, as above, could you expand a bit on why is this needed?

> > ::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
> > @@ +19,5 @@
> > > +
> > > +const MS_IN_A_MINUTE = 60 * 1000;
> > > +
> > > +function run_test() {
> > > +  do_get_profile(true);
> > 
> > Any reason for this argument to be true? Do we need "profile-after-change"
> > to be notified here?
> 
> See the normal startup path in TelemetryStartup.js.

Since you're using |TelemetryController.reset()|, that is taking care of the initial TelemetryController setup. Could be helpful to clarify that with a comment line near |do_get_profile|.
(In reply to Alessio Placitelli [:Dexter] from comment #49)
> > > ::: toolkit/components/telemetry/TelemetryStorage.jsm
> > > @@ -55,5 @@
> > > >  
> > > >  const UUID_REGEX = /^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}$/i;
> > > >  
> > > > -let isPingDirectoryCreated = false;
> > > > -
> > > 
> > > Why are we removing this?
> > 
> > Probably because i uncovered a bug with stale state which was fixed by
> > always checking whether the directory exists.
> 
> Could you expand a bit on why this change is needed (along with the others
> in TelemetryStorage.jsm)? What's the bug you probably uncovered?

As i said, stale state - that flag does never get reset and might not match the on-disk state.

> > > @@ +1073,5 @@
> > > >        }
> > > >  
> > > >        this._pendingPings.set(id, {
> > > >          path: file.path,
> > > > +        lastModificationDate: info.lastModificationDate.getTime(),
> > > 
> > > This belongs to part 2, since you're modifying the file there.
> > 
> > Part 2 runs fine on its own, this new test and the changes here are
> > self-contained. We don't need to keep the changes to one file all in one
> > patch.
> 
> Indeed we don't, but shouldn't part 2 be broken because of this? Anyway,
> it's fine for me to keep it in this patch.

Part 2 is not broken test-wise. I think that's just a subjective style question on where this goes.

> > > @@ +1246,5 @@
> > > >  function getPingDirectory() {
> > > >    return Task.spawn(function*() {
> > > >      let directory = TelemetryStorage.pingDirectoryPath;
> > > >  
> > > > +    if (!(yield OS.File.exists(directory))) {
> > > 
> > > Why are you restoring this?
> > 
> > What do you mean with restoring?
> 
> I mean, as above, could you expand a bit on why is this needed?

This is not restoring anything AFAICT.

> > > ::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
> > > @@ +19,5 @@
> > > > +
> > > > +const MS_IN_A_MINUTE = 60 * 1000;
> > > > +
> > > > +function run_test() {
> > > > +  do_get_profile(true);
> > > 
> > > Any reason for this argument to be true? Do we need "profile-after-change"
> > > to be notified here?
> > 
> > See the normal startup path in TelemetryStartup.js.
> 
> Since you're using |TelemetryController.reset()|, that is taking care of the
> initial TelemetryController setup. Could be helpful to clarify that with a
> comment line near |do_get_profile|.

We do this in other tests too, it's the proper startup for Telemetry.
TelemetryController.reset() happens later in the test after pings were already submitted.
Attachment #8627211 - Flags: review?(alessio.placitelli)
Attachment #8626261 - Attachment is obsolete: true
Comment on attachment 8627211 [details] [diff] [review]
Part 3 - Extend the Telemetry send test coverage

Review of attachment 8627211 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!
Attachment #8627211 - Flags: review?(alessio.placitelli) → review+
Comment on attachment 8627123 [details] [diff] [review]
Part 2: Re-design the ping sending logic

Review of attachment 8627123 [details] [diff] [review]:
-----------------------------------------------------------------

Conditions which need to be tested:

- When there are > 10 pings on disk, all pings get sent eventually, in batches of up to 10 pings
- Expired pings don't get sent
- Expired/overdue/total ping counts should be accurate
- Midnight fuzzing
- Bad pings (access denied or corrupted json) do not prevent sending of other pings
- Exponential backoff behavior on send-fail, including ad-hoc pings resetting the backoff timer
- Correct behaviour at shutdown wrt persisting unpersisted pings and aborting sends

::: toolkit/components/telemetry/tests/unit/head.js
@@ +3,5 @@
>  
>  const { classes: Cc, utils: Cu, interfaces: Ci, results: Cr } = Components;
>  
>  Cu.import("resource://gre/modules/TelemetryController.jsm", this);
> +Cu.import("resource://gre/modules/Services.jsm");

why did you drop the "this"

@@ +248,5 @@
>  }
>  
>  // Fake the timeout functions for TelemetryController sending.
>  function fakePingSendTimer(set, clear) {
> +  Services.telemetry;

what does this do
Attachment #8627123 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #53)
> Comment on attachment 8627123 [details] [diff] [review]
> Part 2: Re-design the ping sending logic
> 
> Review of attachment 8627123 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Conditions which need to be tested:
> 
> - When there are > 10 pings on disk, all pings get sent eventually, in
> batches of up to 10 pings
> - Expired pings don't get sent
> - Expired/overdue/total ping counts should be accurate
> - Midnight fuzzing
> - Bad pings (access denied or corrupted json) do not prevent sending of
> other pings
> - Exponential backoff behavior on send-fail, including ad-hoc pings
> resetting the backoff timer
> - Correct behaviour at shutdown wrt persisting unpersisted pings and
> aborting sends

The new test-coverage goes in part 3 - does this mean part 2 is done or not?

> ::: toolkit/components/telemetry/tests/unit/head.js
> @@ +3,5 @@
> >  
> >  const { classes: Cc, utils: Cu, interfaces: Ci, results: Cr } = Components;
> >  
> >  Cu.import("resource://gre/modules/TelemetryController.jsm", this);
> > +Cu.import("resource://gre/modules/Services.jsm");
> 
> why did you drop the "this"
> 
> @@ +248,5 @@
> >  }
> >  
> >  // Fake the timeout functions for TelemetryController sending.
> >  function fakePingSendTimer(set, clear) {
> > +  Services.telemetry;
> 
> what does this do

Ah, debugging left-overs.
Flags: needinfo?(vdjeric)
Attachment #8627123 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #53)
> - When there are > 10 pings on disk, all pings get sent eventually, in
> batches of up to 10 pings

In part 3, test_TelemetrySend.js

> - Expired pings don't get sent

In test_TelemetrySendOldPings.js

> - Expired/overdue/total ping counts should be accurate

* TelemetrySend.overduePingsCount, .discardedPingsCount (expired) is checked in test_TelemetrySendOldPings.js
* TelemetrySend.pendingPingCount (total after cleaning out) is checked in test_TelemetrySend.js

> - Midnight fuzzing

Already have existing coverage for this.

> - Bad pings (access denied or corrupted json) do not prevent sending of
> other pings

We have test-coverage for corrupt pings not breaking other behavior, access denied is harder to do cross-platform.

> - Exponential backoff behavior on send-fail, including ad-hoc pings
> resetting the backoff timer

In part 3, test_TelemetrySend.js

> - Correct behaviour at shutdown wrt persisting unpersisted pings and
> aborting sends

Presumably covered elsewhere, but i'll add a minimal explicit one in test_TelemetrySend.js
Attachment #8627123 - Flags: review?(vdjeric) → review+
Flags: needinfo?(vdjeric)
Attachment #8627123 - Attachment is obsolete: true
This adds an explicit test for the persisting the "current" pings. In adding this i ended up triggering several short-comings of our testing, those fixes are in this version too.
Attachment #8628354 - Flags: feedback?(alessio.placitelli)
Attachment #8627211 - Attachment is obsolete: true
Attachment #8628351 - Flags: review+
Attachment #8628354 - Flags: review+
Attachment #8628365 - Flags: feedback?(alessio.placitelli)
Attachment #8628354 - Attachment is obsolete: true
Attachment #8628354 - Flags: feedback?(alessio.placitelli)
Comment on attachment 8628365 [details] [diff] [review]
Part 3 - Extend the Telemetry send test coverage

Review of attachment 8628365 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good with the things below addressed. Sorry, they slipped through on the last pass.

::: toolkit/components/telemetry/tests/unit/head.js
@@ +267,5 @@
>  
> +function fakeGeneratePingId(func) {
> +  let module = Cu.import("resource://gre/modules/TelemetryController.jsm");
> +  module.Policy.generatePingId = func;
> +}

It looks like this function is being added twice: see line 263.

::: toolkit/components/telemetry/tests/unit/test_PingAPI.js
@@ +374,5 @@
>    let id = yield TelemetryController.submitExternalPing("test-type", {}, {addClientId: true});
>    let ping = yield TelemetryArchive.promiseArchivedPingById(id);
>  
>    Assert.ok(!!ping, "Should have loaded the ping.");
> +  Assert.ok("clientId" in ping, "Ping should have a client id.");

Good catch.

::: toolkit/components/telemetry/tests/unit/test_TelemetrySend.js
@@ +147,5 @@
> +  // Trigger the next tick - we should receive the next 10 type A pings.
> +  PingServer.resetPingHandler();
> +  now = fakeNow(futureDate(now, pingSendTimeout));
> +  timerPromise = waitForTimer();
> +  pingSendTimerCallback();

Shouldn't we be yielding here? I don't think that's really required though.
Attachment #8628365 - Flags: feedback?(alessio.placitelli) → feedback+
(In reply to Georg Fritzsche [:gfritzsche] from comment #44)
> Brendan & Mark, this is kind of a corner case - if a user opts out of FHR
> submissions, we will send a "deletion" ping (which request deleting or
> unlinking server-side data).
> If a user opts out of FHR submissions without having network connections,
> later opts in again and then opts out again - do we need each of the
> individual deletion pings?
> Is there any downside to sending each or only sending the last?
For the purposes of implementing the deletion ping on the server, we only need to see one of them (as long as clientId does not change). It should also be fine to send them all if that simplifies things.
Flags: needinfo?(mreid)
(In reply to Mark Reid [:mreid] from comment #61)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #44)
> > Brendan & Mark, this is kind of a corner case - if a user opts out of FHR
> > submissions, we will send a "deletion" ping (which request deleting or
> > unlinking server-side data).
> > If a user opts out of FHR submissions without having network connections,
> > later opts in again and then opts out again - do we need each of the
> > individual deletion pings?
> > Is there any downside to sending each or only sending the last?
> For the purposes of implementing the deletion ping on the server, we only
> need to see one of them (as long as clientId does not change). It should
> also be fine to send them all if that simplifies things.

Will each deletion ping be sent with a clientId? If so, will the clientId be changed after the first opt-out?

My assumption is that this edge case should happen almost never in actual practice, so we shouldn't worry too too much about it. I think as long as we receive one deletion ping we will be happy. But it would be interesting to see if we do ever get multiple deletion pings from a client.
Flags: needinfo?(bcolloran)
(In reply to brendan c from comment #62)
> (In reply to Mark Reid [:mreid] from comment #61)
> > (In reply to Georg Fritzsche [:gfritzsche] from comment #44)
> > > Brendan & Mark, this is kind of a corner case - if a user opts out of FHR
> > > submissions, we will send a "deletion" ping (which request deleting or
> > > unlinking server-side data).
> > > If a user opts out of FHR submissions without having network connections,
> > > later opts in again and then opts out again - do we need each of the
> > > individual deletion pings?
> > > Is there any downside to sending each or only sending the last?
> > For the purposes of implementing the deletion ping on the server, we only
> > need to see one of them (as long as clientId does not change). It should
> > also be fine to send them all if that simplifies things.
> 
> Will each deletion ping be sent with a clientId? If so, will the clientId be
> changed after the first opt-out?

No, it won't change. This was discussed in bug 1178262, we also will only send one deletion ping per that bug.
Blocks: 1181553
Depends on: 1184082
Comment on attachment 8624527 [details] [diff] [review]
Part 1: Clean up test utils for waiting on ping submissions

Approval Request Comment
[Feature/regressing bug #]: Unified Telemetry
[User impact if declined]: The patches here implementing a new - improved - logic for sending pings. This is a shipping requirement.
[Describe test coverage new/current, TreeHerder]: We have automated test coverage and monitored for impacts from the change in server-side dashboards.
[Risks and why]: Low-risk - Telemetry is very unlikely to be affected per server-side verifications. Minor risk for async shutdown timeouts as it is hard to reduce risks with that, but getting this earlier to Aurora will help us catch remaining issues earlier.
[String/UUID change made/needed]: None.
Attachment #8624527 - Flags: approval-mozilla-aurora?
Comment on attachment 8628351 [details] [diff] [review]
Part 2: Re-design the ping sending logic

Approval Request Comment
[Feature/regressing bug #]: Unified Telemetry
[User impact if declined]: The patches here implementing a new - improved - logic for sending pings. This is a shipping requirement.
[Describe test coverage new/current, TreeHerder]: We have automated test coverage and monitored for impacts from the change in server-side dashboards.
[Risks and why]: Low-risk - Telemetry is very unlikely to be affected per server-side verifications. Minor risk for async shutdown timeouts as it is hard to reduce risks with that, but getting this earlier to Aurora will help us catch remaining issues earlier.
[String/UUID change made/needed]: None.
Attachment #8628351 - Flags: approval-mozilla-aurora?
Comment on attachment 8628365 [details] [diff] [review]
Part 3 - Extend the Telemetry send test coverage

Approval Request Comment
[Feature/regressing bug #]: Unified Telemetry
[User impact if declined]: The patches here implementing a new - improved - logic for sending pings. This is a shipping requirement.
[Describe test coverage new/current, TreeHerder]: We have automated test coverage and monitored for impacts from the change in server-side dashboards.
[Risks and why]: Low-risk - Telemetry is very unlikely to be affected per server-side verifications. Minor risk for async shutdown timeouts as it is hard to reduce risks with that, but getting this earlier to Aurora will help us catch remaining issues earlier.
[String/UUID change made/needed]: None.
Attachment #8628365 - Flags: approval-mozilla-aurora?
Attachment #8628365 - Flags: review+
Comment on attachment 8622564 [details] [diff] [review]
Part 4: Update the ping submission documentation

Approval Request Comment
[Feature/regressing bug #]: Unified Telemetry
[User impact if declined]: The patches here implementing a new - improved - logic for sending pings. This is a shipping requirement.
[Describe test coverage new/current, TreeHerder]: We have automated test coverage and monitored for impacts from the change in server-side dashboards.
[Risks and why]: Low-risk - Telemetry is very unlikely to be affected per server-side verifications. Minor risk for async shutdown timeouts as it is hard to reduce risks with that, but getting this earlier to Aurora will help us catch remaining issues earlier.
[String/UUID change made/needed]: None.
Attachment #8622564 - Flags: approval-mozilla-aurora?
Ping on this bug?
We need this soon for unified Telemetry on 41.
Flags: needinfo?(rkothari)
Comment on attachment 8622564 [details] [diff] [review]
Part 4: Update the ping submission documentation

Given that this feature has automated test coverage and has been in m-c for a few weeks, it seems safe to land it in Aurora now.
Flags: needinfo?(rkothari)
Attachment #8622564 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8624527 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8628351 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8628365 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Requesting QE team to include this bug and related "Unified Telemetry" fixes into their sign-offs for FF41.
Flags: qe-verify+
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #38)
> @@ +588,5 @@
> >        }
> >      }
> >  
> >      Services.telemetry.getHistogramById('TELEMETRY_FILES_EVICTED')
> >                        .add(evictedCount);
> 
> evictedCount never gets incremented, only this._evictedPingsCount


> @@ -341,5 @@
> >                        .add(evictedCount);
> > -
> > -    // Check for overdue pings.
> > -    const overduePings = infos.filter((info) =>
> > -      (now.getTime() - info.lastModificationDate) > OVERDUE_PING_FILE_AGE);
> 
> so currently (before this patch), we would delete pings based on MAX_PINGS
> and only *then* we would count how many pings were older than 2 weeks.
> 
> this under-reports how old the pings on disk are because we delete them
> first based on the too-many-pings criteria first.
> 
> we should have a follow up bug (with new probes) that first report the
> status of the ping cache (how many pings found overdue, how many expired,
> total count) before we delete any pings
> 

Georg: these are the ping eviction bugs I was referring in today's FHR v4 meeting. Are these fixed now?
Flags: needinfo?(gfritzsche)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #74)
> (In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #38)
> > @@ +588,5 @@
> > >        }
> > >      }
> > >  
> > >      Services.telemetry.getHistogramById('TELEMETRY_FILES_EVICTED')
> > >                        .add(evictedCount);
> > 
> > evictedCount never gets incremented, only this._evictedPingsCount

Ah, i completely unloaded things since then - that got fixed in this bug.
Bug 1168835 now dropped this over other measures.

> > @@ -341,5 @@
> > >                        .add(evictedCount);
> > > -
> > > -    // Check for overdue pings.
> > > -    const overduePings = infos.filter((info) =>
> > > -      (now.getTime() - info.lastModificationDate) > OVERDUE_PING_FILE_AGE);
> > 
> > so currently (before this patch), we would delete pings based on MAX_PINGS
> > and only *then* we would count how many pings were older than 2 weeks.
> > 
> > this under-reports how old the pings on disk are because we delete them
> > first based on the too-many-pings criteria first.
> > 
> > we should have a follow up bug (with new probes) that first report the
> > status of the ping cache (how many pings found overdue, how many expired,
> > total count) before we delete any pings
> > 

Bug 1168835 kills the MAX_LRU expiry - we now just check the ping age and only enforce the quota after.
Furthermore we started submitted a "ping age in days".
Hm, given the "overdue" measurements brokenness, we could maybe just remove it over the new "ping age" histogram.
Flags: needinfo?(gfritzsche)
Blocks: 1173941
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.