Closed Bug 1133536 Opened 5 years ago Closed 5 years ago

Detect & report aborted sessions in Telemetry

Categories

(Toolkit :: Telemetry, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox39 --- fixed

People

(Reporter: gfritzsche, Assigned: Dexter)

References

Details

Attachments

(4 files, 22 obsolete files)

1.83 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
25.22 KB, patch
gfritzsche
: review+
Details | Diff | Splinter Review
43.49 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
2.65 KB, patch
Dexter
: review+
Details | Diff | Splinter Review
Looks like this got lost and no bug got filed...

We have an important & blocking data quality issue from aborted sessions.
Currently with Telemetry we just lose the data (with FHR we could recover due to the backing SQLite DB).

We need to regularly (10-60min?) persist a backup of the current subsession to disk (overwriting the last backup).
We also need to save it early after startup (to cover short aborted sessions), probably just in the TelemetrySession deferred initialization we have for now.
On proper shutdown we remove that backup.

If we find such a backup ping on startup, this means we had an aborted session and should send it ("main" ping type, "aborted-session" reason).
Maybe it's enough to store it in the pending ping folder so it gets picked up and sent on startup.
* Fire a timer regularly
* Collect subsession values without resetting data
* Save through TelemetryPing without waiting for it
* If we successfully shutdown, delete it
  * This probably means we need to do saves the saves and deletion through another serial task queue
* On startup, check if we have one of those saved pings
* If yes, just send it normally

This would save a "main" ping with reason "aborted-session".
The only bigger change is probably getting TelemetryPing to save the ping in an explicit location so it is not picking it up when checking pending pings etc.
Lets maybe start with a timer interval of 30 minutes - Benjamin, Vladan, is that acceptable for now?
Flags: needinfo?(vdjeric)
Flags: needinfo?(benjamin)
Unlikely that i could finish this today, so moving this to Alessio and syncing on this now.
Assignee: nobody → alessio.placitelli
Also: Whenever saving/sending subsession pings, we should reschedule the timer so it doesn't fire immediately after this.
What are the costs of doing this? Does it cause jank or other badness? I would typically be asking for something much more frequent: 1-5 minutes, but only of active usage (e.g. don't wake up from idle/sleep just to save a ping). If we can't do that, then let's balance out the power cost versus the frequency.
Flags: needinfo?(benjamin)
This patch introduces the aborted-session ping payload and regularly saves it to disk.
Attached patch part 2 - Add test coverage. (obsolete) — Splinter Review
This patch adds test coverage for the aborted-session payload.
Comment on attachment 8570541 [details] [diff] [review]
part 1 - Write aborted-session pings to disk

I forgot to mention that this patch also removes the |TelemetryPing.testSavePingToFile| test function.
Attachment #8570541 - Flags: review?(vdjeric)
Attachment #8570543 - Flags: review?(vdjeric)
Attached patch part 3 - Update the docs. (obsolete) — Splinter Review
Whoops, I forgot to mention this ping type in the docs.
Attachment #8572727 - Flags: review?(vdjeric)
(In reply to Georg Fritzsche [:gfritzsche] [away Feb 27 - March 15] from comment #2)
> Lets maybe start with a timer interval of 30 minutes - Benjamin, Vladan, is
> that acceptable for now?

30 minutes seems ok, but I'm not 100% clear on what we want to use these for. Benjamin, what are these crashed session pings going to be used for? Finding causes of crashes, correlating configurations system info & configuration with crash rates, de-biasing submitted data, or something else?

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #5)
> What are the costs of doing this? Does it cause jank or other badness?

Collecting Telemetry data (e.g. gatherMemory) can cause main-thread jank. Before unification, we only collected Telemetry data on idle-daily & shutdown, so we didn't have to worry about jank.

> If we can't do that, then let's balance out the power cost versus the frequency.

Right. For future work, we need to take into account user activity when scheduling Telemetry data collection for aborted session pings as well as daily pings. Ditto for scheduling ping submission.
Status: NEW → ASSIGNED
Flags: needinfo?(vdjeric) → needinfo?(benjamin)
Comment on attachment 8570541 [details] [diff] [review]
part 1 - Write aborted-session pings to disk

Review of attachment 8570541 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +493,5 @@
>    // The deferred promise resolved when the initialization task completes.
>    _delayedInitTaskDeferred: null,
> +  // The timer used to collect & save subsession values, to prevent data loss due to
> +  // aborted sessions.
> +  _abortedSessionTimerId: null,

We're up to 3 timers now (daily ping, aborted session ping, setupTelemetry timer) and I bet we'll have additional timers in the future. It's going to be tough to co-ordinate all of them, synchronize them with other events (sending pings, user idleness, environment changes, etc) and it's going to be very hard to audit the logic.

What do you think about having a single 30 minute timer driving all regularly-scheduled jobs? The "scheduler" would wake up every 30 minutes and check if we need to update the aborted-session ping, or collect & send a daily ping, or send the backlogged pings, or prune old pings, etc.

This would let us reason about scheduling Telemetry actions in a central place (the callback from the timer).

- For example, if an aborted-session ping is scheduled within 30 minutes of a daily ping, we can just generate a daily ping and then save it as an aborted-session ping as well (coalescing). 
- We can add protections against a job type being starved as a result of frequent ad-hoc rescheduling, since the scheduler will know the last time it completed a periodic task
- I think it would also make it easier to integrate user idleness into scheduling decisions, by having the "scheduler" listen to user idleness notifications and having it execute jobs earlier if it learns the user has become idle
- It would be easy to guarantee timers don't fire after shutdown begins
- We could deal with a computer going to sleep, in one place
- It becomes easier to co-ordinate periodic jobs with the sending of pings and random events like environment changes
- The code becomes more synchronous. Timer fires -> scheduler code runs -> calls e.g. saveDailyPing(). saveDailyPing can still use OS.File and other promise-returning code, but the scheduler won't wait on any Promises. If the Promise fails to resolve, the scheduler will just try again in 30 minutes, or if that's not enough the promise's error handler can inform the scheduler

What do you think? I admit I haven't worked out all the implications of this approach, but I'd really like to avoid piling up timers and Promises and then leaving co-ordination for future work.

We should have done a design document for this stuff :/
Attachment #8570541 - Flags: review?(vdjeric)
Flags: needinfo?(gfritzsche)
Flags: needinfo?(alessio.placitelli)
Attachment #8570543 - Flags: review?(vdjeric)
Attachment #8572727 - Flags: review?(vdjeric)
There are two primary uses for aborted session pings:

* make sure we collect all the normal data even if the browser aborts. Note that aborts aren't just browser crashes, but also happen frequently if you logout/shutdown windows while the browser is open: we don't do a clean shutdown procedure in that case.
* Understand more thoroughly non-clean shutdowns, comparea those against actual recorded crashes, and correlate those with other metrics to see if e.g. some addons cause shutdown problems.
Flags: needinfo?(benjamin)
This WIP patch adds a simple scheduler object and makes aborted-session pings use it.
Attachment #8570541 - Attachment is obsolete: true
Flags: needinfo?(alessio.placitelli)
This patch adds aborted-session test coverage using the scheduler. It's missing TelemetryScheduler tests.
Attachment #8570543 - Attachment is obsolete: true
As discussed during our last meeting, I've updated the scheduler:

- to use timestamps;
- to record last attempts time and log execution errors;
Attachment #8574733 - Attachment is obsolete: true
Updated the tests to work with the new scheduler.
Attachment #8574734 - Attachment is obsolete: true
Attachment #8576630 - Flags: review?(vdjeric)
Comment on attachment 8576630 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v3 (WIP)

Review of attachment 8576630 [details] [diff] [review]:
-----------------------------------------------------------------

I think we mis-communicated again :(

What I was suggesting is something as follows. Please do let me know if you disagree with this approach :)

let TelemetryScheduler = {
  lastDailyPing: 0,
  lastSessionCheckpoint: 0,
  nextDailyPing: 0,
  nextSessionCheckpoint: 0,

  // for sanity checking
  lastAdhocPing: 0,
  lastTimerFire: 0,

  shuttingDown: false,

  init: () => {
     this.nextDailyPing = Date.now() + DAILY_PING_INTERVAL;
     this.nextSessionCheckpoint = Date.now() + ABORTED_SESSION_INTERVAL;
     < schedule timer >
  },

  _onSchedulerTick: () => {
    if (shutting down) {
      // uninitialize & return immediately
    }

    let now = Date.now();

    if (now - this.lastTimerFire > 1.1 * TIMER_INTERVAL) {
      // log a message about the computer having gone to sleep
    }
    
    this.lastTimerFire = now;
    let isDailyPingDue = (this.nextDailyPing * 0.95 <= now); // check if daily is due, with 5% tolerance
    let isAbortedPingDue = (this.nextSessionCheckpoint *0.95 <= now); // ditto for aborted-session pings
    let combineActions = (isDailyPingDue || isAbortedPingDue) && (Math.abs(this.nextDailyPing - this.nextSessionCheckpoint) < COALESCE_THRESHOLD);

    if (combineActions) {
      < generate a single ping >
      < enqueue it as a pending daily ping >
      this.nextDailyPing = now + DAILY_PING_INTERVAL;
      < save it as an aborted-session ping >
      this.nextSessionCheckpoint = now + ABORTED_SESSION_INTERVAL;
    } else if (isDailyPingDue) {
        ...
    } else if (isAbortedPingDue) {
        ...
    } else {
       // Nothing is due, do some sanity checking by looking at lastSessionCheckpoint, lastDailyPing, and lastAdhocPing
    }

    < reschedule the timer for 30 mins from now >
  },

  reschedulePings: (reason, competingPing = null) => {
    now = Date.now();
    this.lastAdhocPing = now;
    if (reason == "environment-changed") {
      // we just generated an environment-changed ping, update the schedules
      this.nextDailyPingTime = now + DAILY_PING_INTERVAL;
      this.nextSessionCheckpoint = now + ABORTED_SESSION_INTERVAL;
    }

    if (!shuttingDown) {
      < reschedule the timer for 30 mins from now >
    }
  },

  // Some ping operations rely on promises. We want those promises to inform the schedule of their success or failure
  pingSucceeded(pingType): {
    < log the success >
    now = Date.now();
    if (pingType == "daily-ping") {
      this.lastDailyPing = now;
    } else if (pingType == "session-checkpoint") {
      this.lastSessionCheckpoint = now;
    } 
  },

  pingFailed(pingType): {
    < log the failure >
    now = Date.now();
    if (pingType == "daily-ping") {
      this.nextDailyPing = now + RETRY_DELAY;
    } else if (pingType == "session-checkpoint") {
      this.nextSessionCheckpoint = now + RETRY_DELAY;
    }
    
    if (!shuttingDown) {
       < reschedule timer for RETRY_DELAY minutes in the future >
    }
  }

  shutdown: () => {
    < cancel timer >
    this.shuttingDown = true;
  }
};

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +68,5 @@
>  const MESSAGE_TELEMETRY_PAYLOAD = "Telemetry:Payload";
>  const MESSAGE_TELEMETRY_GET_CHILD_PAYLOAD = "Telemetry:GetChildPayload";
>  
> +const ABORTED_SESSION_DIRECTORY = "saved-aborted-sessions";
> +const ABORTED_SESSION_FILE_NAME = "aborted-session-ping";

why not name them with GUIDs like the other pings?
Attachment #8576630 - Flags: review?(vdjeric)
correction: reschedulePings should save an environment-changed ping to disk as an aborted-ping
Blocks: 1140037
Blocks: 1139751
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #17)
> > +const ABORTED_SESSION_DIRECTORY = "saved-aborted-sessions";
> > +const ABORTED_SESSION_FILE_NAME = "aborted-session-ping";

I don't think we need a new directory here, it's just one file and we could just store it in the datareporting directory.
Flags: needinfo?(gfritzsche)
This is missing the sanity checking part when no pings are due.
Attachment #8576630 - Attachment is obsolete: true
Attachment #8578140 - Flags: feedback?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #17)
> ::: toolkit/components/telemetry/TelemetrySession.jsm
> @@ +68,5 @@
> >  const MESSAGE_TELEMETRY_PAYLOAD = "Telemetry:Payload";
> >  const MESSAGE_TELEMETRY_GET_CHILD_PAYLOAD = "Telemetry:GetChildPayload";
> >  
> > +const ABORTED_SESSION_DIRECTORY = "saved-aborted-sessions";
> > +const ABORTED_SESSION_FILE_NAME = "aborted-session-ping";
> 
> why not name them with GUIDs like the other pings?

Because we only generate the ping GUID when it is actually saved to disk. Also, with each save, the ping GUID (ping.id) changes, so I thought to keep it simple and give this ping a human readable name.

I think that setting a manually generated GUID to ABORTED_SESSION_FILE_NAME would be confusing, as there would be a mismatch between the file name and the ping id.
Comment on attachment 8578140 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v4 (WIP)

Review of attachment 8578140 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +82,5 @@
>  const TELEMETRY_DELAY = 60000;
>  // Delay before initializing telemetry if we're testing (ms)
>  const TELEMETRY_TEST_DELAY = 100;
> +// Execute a scheduler tick every 5 minutes.
> +const SCHEDULER_TICK_INTERVAL_MS = 5 * 60 * 1000;

- It doesn't make sense to have the tick interval be more fine-grained than the most frequent operation (saving aborted sessions)
- waking up every 5 minutes to overwrite a 100KB file seems a bit wasteful. We should disable the scheduler or greatly increase its period when the user is idle. This can be done in a followup bug, but it needs to block phase 2
- Do we need extended (Telemetry) data in the aborted session ping?

@@ +100,5 @@
> +// The frequency at which we persist session data to the disk to prevent data loss
> +// in case of aborted sessions (currently 30 minutes).
> +const ABORTED_SESSION_UPDATE_INTERVAL_MS = 30 * 60 * 1000;
> +// The delay after which the first aborted session save is scheduled (currently 2 minutes).
> +const ABORTED_SESSION_STARTUP_DELAY_MS = 2 * 60 * 1000;

we're gonna miss startup crashes and crashes in sessions < 2 mins, but that might be acceptable

@@ +173,5 @@
> +  setSchedulerTickTimeout: (callback, delayMs) => setTimeout(callback, delayMs),
> +  clearSchedulerTickTimeout: id => clearTimeout(id),
> +  isDailyPingDue: (dueTime, now) => (dueTime * SCHEDULER_TASK_DUE_TOLERANCE <= now),
> +  isAbortedPingDue: (dueTime, now) => (dueTime * SCHEDULER_TASK_DUE_TOLERANCE <= now),
> +  getSessionCheckpointTime: (now, offset) => now + offset,

I think we can make the code a lot cleaner by adding only clearTimeout and setTimeout to Policy. In tests, those methods would redirect to test code and would not set any real timers. That way the unit test code controls the passage of time via Date.now() and can do the "timer" callback itself when it sees fit.

@@ +396,5 @@
> +    this._shuttingDown = false;
> +    let now = Policy.now();
> +    this._nextDailyPingTime = this._estimateDailyPingDueTime(now);
> +    this._nextSessionCheckpointTime =
> +      Policy.getSessionCheckpointTime(now.getTime(), ABORTED_SESSION_STARTUP_DELAY_MS);

if we're calling a custom method to calculate the next session checkpoint time, that method shouldn't need to be passed constants (ABORTED_SESSION_STARTUP_DELAY_MS)

@@ +415,5 @@
> +    this._schedulerTimer =
> +      Policy.setSchedulerTickTimeout(() => this._onSchedulerTick(), aTimeout);
> +  },
> +
> +  _estimateDailyPingDueTime: function(now) {

add comments to all the method definitions

@@ +416,5 @@
> +      Policy.setSchedulerTickTimeout(() => this._onSchedulerTick(), aTimeout);
> +  },
> +
> +  _estimateDailyPingDueTime: function(now) {
> +    let midnight = truncateToDays(now).getTime() + MS_IN_ONE_DAY;

is the scheduling of the daily ping independent of when other types of pings were sent?

@@ +446,5 @@
> +    if (now - this.lastTimerFire > 1.1 * SCHEDULER_TICK_INTERVAL_MS) {
> +      this._log.trace("_onSchedulerTick - First scheduler tick after sleep.");
> +    }
> +
> +    this._lastTickTime = now;

nit: move line above to earlier (after "let now = ..")

@@ +448,5 @@
> +    }
> +
> +    this._lastTickTime = now;
> +    // Check if aborted-session ping is due.
> +    let isAbortedPingDue = Policy.isAbortedPingDue(this._nextSessionCheckpointTime, now);

this is too obfuscated for a simple less-than comparison

@@ +465,5 @@
> +    if (combineActions) {
> +      this._log.trace("_onSchedulerTick - Combining pings.");
> +      // Send the daily ping and also save its payload as an aborted-session ping.
> +      promise = Impl._sendDailyPing(true).then(() => this.pingSucceeded(REASON_DAILY),
> +                                               () => this.pingFailed(REASON_DAILY));

looks like wouldn't get a success/fail callback for saving the aborted-session ping :(

@@ +513,5 @@
> +      this._nextSessionCheckpoint =
> +        Policy.getSessionCheckpointTime(now.getTime(), ABORTED_SESSION_UPDATE_INTERVAL_MS);
> +    }
> +
> +    if (!this._shuttingDown) {

rescheduleTimeout already has the shutDown check

@@ +537,5 @@
> +  pingFailed: function(reason) {
> +    this._log.error("pingFailed - reason: " + reason);
> +    let now = Policy.now().getTime();
> +    if (reason == REASON_DAILY) {
> +      this._nextDailyPingTime = now + SCHEDULER_TICK_RETRY_DELAY_MS;

- where is SCHEDULER_TICK_RETRY_DELAY_MS defined?
- we should have some limit to the number of retries or an exponential backoff

@@ +543,5 @@
> +      this._nextSessionCheckpointTime = now + SCHEDULER_TICK_RETRY_DELAY_MS;
> +    }
> +
> +    if (!this._shuttingDown) {
> +       this._rescheduleTimeout(SCHEDULER_TICK_RETRY_DELAY_MS);

- rescheduleTimeout already has the shutDown check

@@ +1259,5 @@
>        addClientId: true,
>        addEnvironment: true,
>      };
> +    return TelemetryPing.send(getPingType(payload), payload, options)
> +                        .then(notifyScheduler, notifyScheduler);

i don't think we want to reschedule pings upon a successful send

@@ +1778,5 @@
>  
> +  /**
> +   * Gather and send a daily ping.
> +   * @param {Boolean} [saveAsAborted=false] Also saves the payload as an aborted-session
> +   *                  ping.

describe return type?

@@ +1793,3 @@
>  
> +    // If required, also save the payload as an aborted session.
> +    if (saveAsAborted) {

couln't this be expressed as:

promise = TelemetryPing.send(...);
if (saveAsAborted) {
  promise.then(..);
}
return promise;
Attachment #8578140 - Flags: feedback?(vdjeric) → feedback-
Blocks: 1143796
No longer blocks: 1069869
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #22)
> @@ +100,5 @@
> > +// The frequency at which we persist session data to the disk to prevent data loss
> > +// in case of aborted sessions (currently 30 minutes).
> > +const ABORTED_SESSION_UPDATE_INTERVAL_MS = 30 * 60 * 1000;
> > +// The delay after which the first aborted session save is scheduled (currently 2 minutes).
> > +const ABORTED_SESSION_STARTUP_DELAY_MS = 2 * 60 * 1000;
> 
> we're gonna miss startup crashes and crashes in sessions < 2 mins, but that
> might be acceptable

I think we should write out an initial aborted-session ping (if performance is a concern early with an empty payload).
We want to detect as many early crashes as possibly and we already miss some from the delayed telemetry startup.
Blocks: 1069869
Thanks for your feedback Vladan. This patch takes care of what you suggested.
Attachment #8578140 - Attachment is obsolete: true
Attached patch part 2 - Add test coverage - v4 (obsolete) — Splinter Review
Attachment #8576631 - Attachment is obsolete: true
Attachment #8578736 - Flags: feedback?(vdjeric)
Attachment #8578736 - Flags: feedback?(gfritzsche)
Comment on attachment 8578736 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v4

Review of attachment 8578736 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +482,5 @@
> +    }
> +    this._lastTickTime = now;
> +
> +    // Check if aborted-session ping is due.
> +    let isAbortedPingDue = this._nextSessionCheckpointTime * SCHEDULER_TASK_DUE_TOLERANCE <= now;

I think this check needs to be changed, because it makes a task due even if tasks are more than 60 minutes away (even with a tolerance of 0.999).

We could make it far less tolerant, or could be changed to:

  let isAbortedPingDue= (this._nextSessionCheckpointTime <= now) ||
      (Math.abs(now - this._nextSessionCheckpointTime) <= TOLERANCE_MS);
Attachment #8578740 - Flags: review?(vdjeric)
Attachment #8572727 - Flags: review?(vdjeric)
Attachment #8578736 - Flags: feedback?(vdjeric) → review?(vdjeric)
I took the liberty to change the tolerance check.
Attachment #8578736 - Attachment is obsolete: true
Attachment #8578736 - Flags: review?(vdjeric)
Attachment #8578736 - Flags: feedback?(gfritzsche)
Attachment #8579299 - Flags: review?(vdjeric)
(In reply to Georg Fritzsche [:gfritzsche] [away Feb 27 - March 15] from comment #23)
> I think we should write out an initial aborted-session ping (if performance
> is a concern early with an empty payload).
> We want to detect as many early crashes as possibly and we already miss some
> from the delayed telemetry startup.

I don't think we need to write out an initial aborted-session ping as soon as Firefox starts. The purpose of aborted pings is to 1) collect all the normal data even if the browser aborts and 2) to understand more thoroughly non-clean shutdowns (comment 12). In the meeting this week, Benjamin clarified #1 explaining that he's interested in search count measurements etc. So I don't think startup FHR/Telemetry measurements are valuable.
Also we won't be using these pings to diagnose startup crashes.
(In reply to Alessio Placitelli [:Dexter] from comment #26)
> I think this check needs to be changed, because it makes a task due even if
> tasks are more than 60 minutes away (even with a tolerance of 0.999).
> 
> We could make it far less tolerant, or could be changed to:
> 
>   let isAbortedPingDue= (this._nextSessionCheckpointTime <= now) ||
>       (Math.abs(now - this._nextSessionCheckpointTime) <= TOLERANCE_MS);

Agree
Blocks: 1144972
Comment on attachment 8579299 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v5

Review of attachment 8579299 [details] [diff] [review]:
-----------------------------------------------------------------

My major comments from the latest review pass are below.
More generally, this patch has taken a while to get reviewed, and every time I do a review pass, I find more stuff I think needs to be changed. Let me do another couple of passes throughout the day tomorrow before changing this patch again.

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +67,5 @@
>  
>  const MESSAGE_TELEMETRY_PAYLOAD = "Telemetry:Payload";
>  const MESSAGE_TELEMETRY_GET_CHILD_PAYLOAD = "Telemetry:GetChildPayload";
>  
> +const ABORTED_SESSION_DIRECTORY = "datareporting";

make this into a more general name, and replace the other hard-coded "datareporting" strings in this file

@@ +90,5 @@
> +// if a scheduled operation fails.
> +const SCHEDULER_RETRY_ATTEMPTS = 3;
> +
> +// The tolerance we have when checking if a task is due in the scheduler (5 minutes).
> +const SCHEDULER_DUE_TOLERANCE_MS = 5 * 60 * 1000;

- the tolerance should be less than the tick interval, definitely less than SCHEDULER_TICK_INTERVAL_MS / 2
- do we need a 5 minute tick if our least frequent action is 30 mins?
- if we're ticking every 5 minutes, we don't need a tolerance anymore . It didn't make sense to wait for 30 minutes for the next tick to execute a task due in 1 minute, but with a 5 minute tick period, that's not a problem anymore.

@@ +107,5 @@
> +const ABORTED_SESSION_UPDATE_INTERVAL_MS = 30 * 60 * 1000;
> +
> +// The job used to collect & save subsession values, to prevent data loss due to
> +// aborted sessions.
> +const ABORTED_SESSION_JOB_ID = "aborted-session-job";

this const isn't used anymore

@@ +405,5 @@
> +
> +  _log: null,
> +
> +  // The number of times a failed operation was rescheduled.
> +  _retryAttempts: 0,

Do we need to distinguish between retrying sends & retrying saving?

@@ +496,5 @@
> +
> +    // Check if aborted-session ping is due.
> +    let isAbortedPingDue = (this._nextSessionCheckpointTime <= now) ||
> +                           areTimesClose(this._nextSessionCheckpointTime, now, SCHEDULER_DUE_TOLERANCE_MS);
> +#if !defined(MOZ_WIDGET_GONK) && !defined(MOZ_WIDGET_ANDROID)

I think this #if should cover isAbortedPingDue as well. B2G & Fennec only do "classic" Telemetry pings.
In fact, I think we can just return early at the top of this function for B2G & Fennec

@@ +501,5 @@
> +    // Check if daily ping is due.
> +    let isDailyPingDue = (this._nextDailyPingTime <= now) ||
> +                         areTimesClose(this._nextDailyPingTime, now, SCHEDULER_DUE_TOLERANCE_MS);
> +    // We can combine the daily-ping and the aborted-session ping in the following cases:
> +    // - If both the daily and the aborted session pings are due (a laptop that wakes

I think we should explicitly indicate in the daily ping when a computer has woken up after having crossed a midnight boundary (bug 1144972)
We should also explicitly report how much of the session's uptime was spent sleeping (bug 1144974).
These bugs are not urgent at all

@@ +624,5 @@
> +      this._nextSessionCheckpointTime = now + delay;
> +    } else if (reason == REASON_DAILY) {
> +      this._nextDailyPingTime = now + delay;
> +    } else if (reason == REASON_ABORTED_SESSION) {
> +      this._nextSessionCheckpointTime = now + delay;

If we reached the limit of 3 retries, wouldn't this just set _nextSessionCheckpointTime to now + 5 minutes? In case of persistent failure, this would cause inifinite retries (every tick)!

This is the kind of bug I was hoping to avoid with this TelemetryScheduler approach. I think we can make TelemetryScheduler more robust by making better use of the _last* variables. Right now, we just write to them and never read them.

On each tick, we should decide whether anything needs to happen by looking at 1) the current time, and 2) when we last created each of the periodic pings.

So a daily ping is due if: (now ~ midnight && (now - _lastDailyPingTime >= minimum subping length))
An aborted ping is due if: (now - _lastSessionCheckpointTime >= 30 mins).
_onSchedulerTick increments the try count on every attempt.
When a ping promise succeeds, it updates the _last* timestamp to prevent retries and resets the try count to 0.
When a ping promise fails, it does nothing. Retries happen automatically on every tick while the above conditions are true.
Retries stop when the try count for an operation reaches 3. At that point we set _last*Time = current time. As a result, we woouldn't re-attempt dailies until next midnight after the last failed try, and we wouldn't re-attempt aborted pings for 30 minutes after the last failed try.
We could also keep a lastSuccessful*Ping but we don't really need it for anything (just like the _last* vars aren't used for anything in the current patch).

This would allow us to drop the _next* vars, which are proving tricky to maintain. I originally suggested the _next* approach because I thought we needed the freedom to reschedule dailies & aborted-session pings to arbitrary times, in order to accommodate unplanned computer sleeps & environment-changed pings, but sleeps are not an issue in this approach & the scheduler can also keep a _lastEnvironmentChangedTime var if needed.

What do you think?
Attachment #8579299 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #30)
> Comment on attachment 8579299 [details] [diff] [review]
> part 1 - Write aborted-session pings to disk - v5
> 
> Review of attachment 8579299 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +90,5 @@
> > +// if a scheduled operation fails.
> > +const SCHEDULER_RETRY_ATTEMPTS = 3;
> > +
> > +// The tolerance we have when checking if a task is due in the scheduler (5 minutes).
> > +const SCHEDULER_DUE_TOLERANCE_MS = 5 * 60 * 1000;
> 
> - the tolerance should be less than the tick interval, definitely less than
> SCHEDULER_TICK_INTERVAL_MS / 2
> - do we need a 5 minute tick if our least frequent action is 30 mins?
> - if we're ticking every 5 minutes, we don't need a tolerance anymore . It
> didn't make sense to wait for 30 minutes for the next tick to execute a task
> due in 1 minute, but with a 5 minute tick period, that's not a problem
> anymore.

That said, it might make more sense to just tick every 30 minutes and keep the tolerances.

> @@ +624,5 @@
> > +      this._nextSessionCheckpointTime = now + delay;
> > +    } else if (reason == REASON_DAILY) {
> > +      this._nextDailyPingTime = now + delay;
> > +    } else if (reason == REASON_ABORTED_SESSION) {
> > +      this._nextSessionCheckpointTime = now + delay;
> 
> If we reached the limit of 3 retries, wouldn't this just set
> _nextSessionCheckpointTime to now + 5 minutes? In case of persistent
> failure, this would cause inifinite retries (every tick)!
> 
> This is the kind of bug I was hoping to avoid with this TelemetryScheduler
> approach. I think we can make TelemetryScheduler more robust by making
> better use of the _last* variables. Right now, we just write to them and
> never read them.
> 
> On each tick, we should decide whether anything needs to happen by looking
> at 1) the current time, and 2) when we last created each of the periodic
> pings.
> 
> So a daily ping is due if: (now ~ midnight && (now - _lastDailyPingTime >=
> minimum subping length))
> An aborted ping is due if: (now - _lastSessionCheckpointTime >= 30 mins).
> _onSchedulerTick increments the try count on every attempt.
> When a ping promise succeeds, it updates the _last* timestamp to prevent
> retries and resets the try count to 0.
> When a ping promise fails, it does nothing. Retries happen automatically on
> every tick while the above conditions are true.

This sounds like a good idea since we don't need more flexibility!

> [...]
> This would allow us to drop the _next* vars, which are proving tricky to
> maintain. I originally suggested the _next* approach because I thought we
> needed the freedom to reschedule dailies & aborted-session pings to
> arbitrary times, in order to accommodate unplanned computer sleeps &
> environment-changed pings, but sleeps are not an issue in this approach &
> the scheduler can also keep a _lastEnvironmentChangedTime var if needed.
> 
> What do you think?

This greatly simplifies the scheduler. Since there's no particular reason to allow rescheduling of pings at arbitrary time, that's ok: I can't think of any reason not to do it this way.
Comment on attachment 8572727 [details] [diff] [review]
part 3 - Update the docs.

Review of attachment 8572727 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/docs/main-ping.rst
@@ +6,5 @@
>  It includes the histograms and other performance and diagnostic data.
>  
>  This ping is triggered by different scenarios, which is documented by the ``reason`` field:
>  
> +* ``aborted-session`` - triggered if a previous aborted session is found at startup

Lets expand on the mechanics a little, i.e. that sessions are persisted to disk in regular intervals for this.
Attachment #8572727 - Flags: review?(vdjeric)
(In reply to Georg Fritzsche [:gfritzsche] [away Feb 27 - March 15] from comment #32)
> Comment on attachment 8572727 [details] [diff] [review]
> part 3 - Update the docs.
> 
> Review of attachment 8572727 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/telemetry/docs/main-ping.rst
> @@ +6,5 @@
> >  It includes the histograms and other performance and diagnostic data.
> >  
> >  This ping is triggered by different scenarios, which is documented by the ``reason`` field:
> >  
> > +* ``aborted-session`` - triggered if a previous aborted session is found at startup
> 
> Lets expand on the mechanics a little, i.e. that sessions are persisted to
> disk in regular intervals for this.

Actually - can we add a high-level overview of the timer mechanics too?
That would help other people understand when pings get generated without diving into the code (e.g. server-side engineers, metrics, QA).
Comment on attachment 8578740 [details] [diff] [review]
part 2 - Add test coverage - v4

Review of attachment 8578740 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/tests/unit/head.js
@@ +77,5 @@
>    registrar.registerFactory(XULAPPINFO_CID, "XULAppInfo",
>                              XULAPPINFO_CONTRACTID, XULAppInfoFactory);
>  }
>  
> +// Fake a function in the Policy object from TelemetrySession.

"Fake the timeout functions for the scheduler ..."

::: toolkit/components/telemetry/tests/unit/test_TelemetrySession.js
@@ +72,5 @@
>  const HAS_DATAREPORTINGSERVICE = "@mozilla.org/datareporting/service;1" in Cc;
>  const SESSION_RECORDER_EXPECTED = HAS_DATAREPORTINGSERVICE &&
>                                    Preferences.get(PREF_FHR_SERVICE_ENABLED, true);
>  
> +const ABORTED_PING_DIRECTORY = "datareporting";

Lets call it the DATAREPORTING_DIR.

@@ +1157,5 @@
> +
> +  Assert.ok((yield OS.File.exists(ABORTED_DIR)),
> +            "Telemetry must create the aborted session directory when starting.");
> +
> +

Nit: redundant empty line.

@@ +1161,5 @@
> +
> +  // Fake now again so that the scheduled aborted-session save takes place.
> +  now = futureDate(now, ABORTED_SESSION_UPDATE_INTERVAL_MS);
> +  fakeNow(now);
> +    // The first aborted session checkpoint must take place right after the initialisation.

Nit: indentation.

@@ +1169,5 @@
> +  // Check that the aborted session is due at the correct time.
> +  Assert.ok((yield OS.File.exists(ABORTED_FILE)),
> +            "There must be an aborted session ping.");
> +
> +  // The ping is not yet into the saved-pings folder, so we can't access it using

Nit: is not yet *saved*?

@@ +1195,5 @@
> +  // the aborted-session ping to the pending pings, we must verify that it exists.
> +  const PENDING_PING_FILE =
> +    OS.Path.join(TelemetryFile.pingDirectoryPath, abortedSessionPing.id);
> +  Assert.ok((yield OS.File.exists(PENDING_PING_FILE)),
> +            "The aborted session ping must exist in the saved pings directory.");

I don't think we should check the files directly.
We should just trigger ping submission and check that the correct aborted-session got submitted (checking type, reason and id).

@@ +1214,5 @@
> +  gRequestIterator = Iterator(new Request());
> +
> +  // Schedule a daily gathering 2 minutes from now, which overlaps with the aborted session
> +  // first save.
> +  let nowDate = new Date(2009, 10, 18, 00, 00, 0);

Why would that trigger a daily gathering in 2 minutes?
This is setting now to 12AM, not 11:58PM?

@@ +1225,5 @@
> +  Assert.ok((yield OS.File.exists(ABORTED_DIR)),
> +            "Telemetry must create the aborted session directory when starting.");
> +
> +  // Force scheduling the aborted session around midnight so it gets merged with the
> +  // daily ping.

I think this comment should be improved:
You are not forcing the schedule of aborted to midnight, but rather delaying/skipping the callback for aborted to make them fall together.

@@ +1240,5 @@
> +  Assert.equal(dailyPing.payload.info.reason, REASON_DAILY);
> +
> +  // Check that an aborted session ping was also written to disk.
> +  Assert.ok((yield OS.File.exists(ABORTED_FILE)),
> +            "There must be an aborted session ping.");

We may want to assert at least session & subsession ids being equal between the daily and aborted-session pings?

@@ +1244,5 @@
> +            "There must be an aborted session ping.");
> +
> +  yield TelemetrySession.shutdown();
> +});
> +

Can we also get test coverage for more than the first aborted-session?
I.e. trigger the first one & check it, then trigger the second one and check it being different/updated.
Attachment #8578740 - Flags: review?(vdjeric) → feedback+
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #30)
> @@ +405,5 @@
> > +
> > +  _log: null,
> > +
> > +  // The number of times a failed operation was rescheduled.
> > +  _retryAttempts: 0,
> 
> Do we need to distinguish between retrying sends & retrying saving?

Thinking of this gain, if we schedule the tick every 30 minutes, even if we fail saving the aborted-session every time, it still would be scheduled on the next tick, which is on 30 minutes. There would be no difference compared to the normal "save" cycle, except for a log print.

> So a daily ping is due if: (now ~ midnight && (now - _lastDailyPingTime >=
> minimum subping length))
> An aborted ping is due if: (now - _lastSessionCheckpointTime >= 30 mins).
> _onSchedulerTick increments the try count on every attempt.
> When a ping promise succeeds, it updates the _last* timestamp to prevent
> retries and resets the try count to 0.
> When a ping promise fails, it does nothing. Retries happen automatically on
> every tick while the above conditions are true.
>
> What do you think?

So retrying doesn't change the tick interval, right?

Also, the conditions over which we can coalesce the two action should slightly change: we combine the pings if both the daily and the aborted pings are due or if the daily ping is due and the aborted ping will follow shortly after.
(In reply to Alessio Placitelli [:Dexter] from comment #31)
> That said, it might make more sense to just tick every 30 minutes and keep
> the tolerances.

Sure, but I think Benjamin was saying we should save aborted-session pings more frequently (on the order of 5 minutes)

> Thinking of this gain, if we schedule the tick every 30 minutes, even if we
> fail saving the aborted-session every time, it still would be scheduled on
> the next tick, which is on 30 minutes. There would be no difference compared
> to the normal "save" cycle, except for a log print.

You're right, we can set _lastSessionCheckpointTime immediately because aborted pigns get saved frequently. We only really want to bother retrying the dailies, but once we separate the daily ping generation and the actual sending of accumulated pings, I'd say we don't need to retry *generating* daily pings at allk. If we get a failure *generating* a ping, the failure is likely not transient. It's more likely a bug in our code. Only the actual sending of pings needs to be retried -- because the server might be busy temporarily or the internet connection is flaky.

> So retrying doesn't change the tick interval, right?

Right

> Also, the conditions over which we can coalesce the two action should
> slightly change: we combine the pings if both the daily and the aborted
> pings are due or if the daily ping is due and the aborted ping will follow
> shortly after.

Right
Comment on attachment 8579299 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v5

Review of attachment 8579299 [details] [diff] [review]:
-----------------------------------------------------------------

- In _onSchedulerTick, we have to be 100% sure we always call rescheduleTimeout() before returning from the function. Right now, it's not clear at a glance that rescheduleTimeout() always gets called in _onSchedulerTick, and if any code in onSchedulerTick throws an exception, the scheduler will stop ticking for the rest of the browser session! 
The code should look more like _onSchedulerTick: function () { try { ...} finally { rescheduleTimeout(); } }
- what do you think about putting TelemetryScheduler into its own .jsm?
- Confirm how setTimeout behaves after a sleep. e.g. if at 7am, setTimeout() get scheduled for 8am, but the computer is asleep from 7:01am to 9am, confirm it will fire immediately upon waking up. As future work, we might want to wait a few minutes after waking up before saving & sending anything
- Just a thought: when TelemetryScheduler starts taking user idleness into account, and the user goes from active to idle, TelemetryScheduler should simply call setTimeout(next midnight) to schedule the next tick. We'll also have to land improvements to user idleness detection before we begin that work

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +801,5 @@
>    _delayedInitTaskDeferred: null,
> +  // Used to serialize session state writes to disk.
> +  _stateSaveSerializer: new SaveSerializer(),
> +  // Used to serialize aborted session ping writes to disk.
> +  _abortedSessionSerializer: new SaveSerializer(),

do we need serializers because we use a collectPing.then(writePing) pattern and collectPings could race?
Comment on attachment 8578740 [details] [diff] [review]
part 2 - Add test coverage - v4

Review of attachment 8578740 [details] [diff] [review]:
-----------------------------------------------------------------

Can you add checks for the following behavior:

- Make sure we do nothing, when nothing is due. e.g. tick 2 minutes after saving an aborted-session & daily ping, make sure no pings are sent or saved
- Check for correct behavior after ticking for the first time after the computer wakes up from 3 days of sleep
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #37)
> Comment on attachment 8579299 [details] [diff] [review]
> part 1 - Write aborted-session pings to disk - v5
> 
> Review of attachment 8579299 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> - In _onSchedulerTick, we have to be 100% sure we always call
> rescheduleTimeout() before returning from the function. Right now, it's not
> clear at a glance that rescheduleTimeout() always gets called in
> _onSchedulerTick, and if any code in onSchedulerTick throws an exception,
> the scheduler will stop ticking for the rest of the browser session! 
> The code should look more like _onSchedulerTick: function () { try { ...}
> finally { rescheduleTimeout(); } }

Good point!

> - what do you think about putting TelemetryScheduler into its own .jsm?

TelemetryScheduler is strongly coupled to TelemetrySession, so I moved it to TelemetrySession.jsm. If you think that would be better, I have no strong opinion against moving it back to a jsm file.

> - Confirm how setTimeout behaves after a sleep. e.g. if at 7am, setTimeout()
> get scheduled for 8am, but the computer is asleep from 7:01am to 9am,
> confirm it will fire immediately upon waking up. As future work, we might
> want to wait a few minutes after waking up before saving & sending anything

The timer will get called after the PC wakes up: setTimeout comes from Timer.jsm [1] and uses XPCOM's TimerThread [2] which deals with sleep/wake up by recomputing the timeout.

> - Just a thought: when TelemetryScheduler starts taking user idleness into
> account, and the user goes from active to idle, TelemetryScheduler should
> simply call setTimeout(next midnight) to schedule the next tick. We'll also
> have to land improvements to user idleness detection before we begin that
> work

And maybe also check when it wakes up again: otherwise having a short sleep near midnight might compromise saving the aborted-session pings.

> ::: toolkit/components/telemetry/TelemetrySession.jsm
> @@ +801,5 @@
> >    _delayedInitTaskDeferred: null,
> > +  // Used to serialize session state writes to disk.
> > +  _stateSaveSerializer: new SaveSerializer(),
> > +  // Used to serialize aborted session ping writes to disk.
> > +  _abortedSessionSerializer: new SaveSerializer(),
> 
> do we need serializers because we use a collectPing.then(writePing) pattern
> and collectPings could race?

We need serializers as calling OS.File* does not guarantee their order of execution, so we might end up with an older ping/state overwriting the latest ping/state.

[1] - https://hg.mozilla.org/mozilla-central/annotate/cbd0efcd976c/toolkit/modules/Timer.jsm#l27
[2] - https://hg.mozilla.org/mozilla-central/annotate/cbd0efcd976c/xpcom/threads/TimerThread.cpp#l498
Thanks for your suggestions, Vladan. This patch removes all the _next* variables and reshapes the scheduler as we discussed.
Attachment #8579299 - Attachment is obsolete: true
Attachment #8580641 - Flags: review?(vdjeric)
(Removed a trailing whitespace)
Attachment #8580641 - Attachment is obsolete: true
Attachment #8580641 - Flags: review?(vdjeric)
Attachment #8580642 - Flags: review?(vdjeric)
Attached patch part 2 - Add test coverage - v5 (obsolete) — Splinter Review
This patch adjusts the tests following the inputs from both Georg and Vladan.
Attachment #8578740 - Attachment is obsolete: true
Attached patch part 2 - Add test coverage - v5 (obsolete) — Splinter Review
Attachment #8580643 - Attachment is obsolete: true
Attachment #8580644 - Flags: review?(gfritzsche)
Attached patch part 3 - Update the docs v2 (obsolete) — Splinter Review
Attachment #8572727 - Attachment is obsolete: true
Attachment #8580646 - Flags: review?(gfritzsche)
Comment on attachment 8580646 [details] [diff] [review]
part 3 - Update the docs v2

Review of attachment 8580646 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/docs/main-ping.rst
@@ +11,3 @@
>  * ``environment-change`` - the :doc:`environment` changed, so the session measurements got reset and a new subsession starts
>  * ``shutdown`` - triggered when the browser session ends
> +* ``daily`` - a session split triggered in 24h hour intervals at local midnight. If an environment-change ping is generated by the time it should be sent, the daily ping is rescheduled for the next midnight

Nit: ``environment-change``
Attachment #8580646 - Flags: review?(gfritzsche) → review+
Thanks for the review, Georg.
Attachment #8580646 - Attachment is obsolete: true
Attachment #8580683 - Flags: review+
Comment on attachment 8580642 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v6

Review of attachment 8580642 [details] [diff] [review]:
-----------------------------------------------------------------

- should we even initialize TelemetryScheduler on B2G & Fennec? if not, we only need to #if check around that init call
- as is, we'll keep retrying aborted & daily pings on each tick if the pingFailed call doesn't fire or gets delayed. It might not be a real issue
Comment on attachment 8580644 [details] [diff] [review]
part 2 - Add test coverage - v5

Review of attachment 8580644 [details] [diff] [review]:
-----------------------------------------------------------------

Looking at the doc patch, i think this is missing a test for this behavior described there:
"If an ``environment-change`` ping is generated by the time it should be sent, the daily ping is rescheduled for the next midnight"

Otherwise this looks mostly good to me now.

::: toolkit/components/telemetry/tests/unit/test_TelemetrySession.js
@@ +1063,5 @@
>  
>  add_task(function* test_savedSessionData() {
>    // Create the directory which will contain the data file, if it doesn't already
>    // exist.
> +  const dataDir  = OS.Path.join(OS.Constants.Path.profileDir, DATAREPORTING_DIR);

Could as well make that a global const.

@@ +1168,5 @@
> +  // Check that the aborted session is due at the correct time.
> +  Assert.ok((yield OS.File.exists(ABORTED_FILE)),
> +            "There must be an aborted session ping.");
> +
> +  // The ping is not yet saved, so we can't access it using TelemetryFile.popPendingPings().

It is saved, just not along with the usual pending pings.

@@ +1185,5 @@
> +  pingContent = yield OS.File.read(ABORTED_FILE, { encoding: "utf-8" });
> +  let updatedAbortedSessionPing = JSON.parse(pingContent);
> +  checkPingFormat(updatedAbortedSessionPing, PING_TYPE_MAIN, true, true);
> +  Assert.equal(updatedAbortedSessionPing.payload.info.reason, REASON_ABORTED_SESSION);
> +  Assert.notEqual(abortedSessionPing.creationDate, updatedAbortedSessionPing.creationDate);

Also check at least for different ping.id.

@@ +1211,5 @@
> +
> +  // Trick: make the aborted ping file overdue so that it gets sent immediately when
> +  // resetting TelemetryPing.
> +  const OVERDUE_PING_FILE_AGE = TelemetryFile.OVERDUE_PING_FILE_AGE + 60 * 1000;
> +  yield OS.File.setDates(PENDING_PING_FILE, null, Date.now() - OVERDUE_PING_FILE_AGE);

We shouldn't need that workaround.
We can just fake a date far enough into the future so that this is overdue.

@@ +1217,5 @@
> +
> +  // Wait for the aborted-session ping.
> +  let request = yield gRequestIterator.next();
> +  let receivedPing = decodeRequestPayload(request);
> +  Assert.equal(receivedPing.payload.info.reason, REASON_ABORTED_SESSION);

Lets fake UUIDs for the pings too so we can verify we definitely got the correct ping?
Attachment #8580644 - Flags: review?(gfritzsche) → feedback+
Comment on attachment 8580642 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v6

Review of attachment 8580642 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +184,5 @@
> + * Takes a date and returns the number of milliseconds from Date(0) until the next
> + * midnight.
> + */
> +function getNextMidnight(date) {
> +  return truncateToDays(date).getTime() + MS_IN_ONE_DAY;

we don't need this function anymore. it would have been incorrect for DST btw :)

@@ +210,5 @@
> +  if (areTimesClose(date.getTime(), lastMidnight, SCHEDULER_MIDNIGHT_TOLERANCE_MS)) {
> +    return true;
> +  }
> +
> +  let nextMidnight = lastMidnight + MS_IN_ONE_DAY;

some days have more or less than 24 hours cause of daylight savings time, do this instead:

nextMidnightDate = new Date(truncateToDays(date));
nextMidnightDate.setDate(nextMidnightDate.getDate() + 1);
nextMidnight = nextMidnightDate.getTime();

@@ +430,5 @@
> +    this._log.trace("init");
> +    this._shuttingDown = false;
> +    let now = Policy.now();
> +    this._lastDailyPingTime = now.getTime();
> +    this._lastSessionCheckpointTime = now.getTime();

add a comment about why you're setting these to now

@@ +445,5 @@
> +      return;
> +    }
> +
> +    if (this._schedulerTimer) {
> +      Policy.clearSchedulerTickTimeout(this._schedulerTimer);

is this for extra safety?

@@ +477,5 @@
> +      this._log.warn("_onSchedulerTick - already shutdown.");
> +      return;
> +    }
> +
> +#if defined(MOZ_WIDGET_GONK) || defined(MOZ_WIDGET_ANDROID)

we can just check for this in TelemetrySession.setupChromeProcess()

@@ +515,5 @@
> +    let isAbortedPingDue =
> +      (now - this._lastSessionCheckpointTime) >= ABORTED_SESSION_UPDATE_INTERVAL_MS;
> +    // Check if daily ping is due.
> +    let isDailyPingDue = isNearMidnight(nowDate) &&
> +                         ((now - this._lastDailyPingTime) >= MIN_SUBSESSION_LENGTH_MS);

isNearMidnight will be true at 23:45, 23:50, 23:55, 0:00, 0:05, 0:10 and 0:15
this would cause us to generate duplicate dailies at 23:45 and 0:05 (or 23:45 and 0:15, etc)

if (near midnight) {
  if (areTimesClose(_lastDailyPingTime, this midnight, MIDNIGHT_TOLERANCE_MS )) {
     // we've already sent a ping for this midnight
     isDailyPingDue = false;
  } else if (subsession length < MIN_SUBSESSION_LENGTH_MS) {
     // generating a daily ping now would create a very short subsession
     isDailyPingDue = false;
  } else {
     isDailyPingDue = true;
  }
} else if (_lastDailyPingTime isn't today nor within MIDNIGHT_TOLERANCE_MS of the *previous* midnight) {
   // computer must have gone to sleep, the daily ping is overdue
   isDailyPingDue = true;
} else {
   isDailyPingDue = false;
}

can you add tests for the conditions above?

@@ +583,5 @@
> +  _pingSucceeded: function(reason=null) {
> +    this._log.trace("pingSucceeded - reason: " + reason);
> +    let now = Policy.now().getTime();
> +    if (!reason) {
> +      this._lastDailyPingTime = now;

There's a chance of a rare bug here where the time from sending off the daily ping to setting _lastDailyPingTime is large because of sleeps.. it's gonna be rare though, let's leave that for later
Comment on attachment 8580642 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v6

Review of attachment 8580642 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +566,5 @@
> +      // update the schedules.
> +      this._saveAbortedPing(now.getTime(), competingPayload);
> +      this._lastSessionCheckpointTime = now.getTime();
> +      // If we're close to midnight, skip today's daily ping and reschedule it for tomorrow.
> +      if (isNearMidnight(now)) {

you're going to need an isDailyPingDue function.. being near midnight is not enough information, the daily might already have been sent for this midnight or the daily might be overdue

@@ +605,5 @@
> +  _pingFailed: function(reason=null) {
> +    this._log.error("_pingFailed - reason: " + reason);
> +    let now = Policy.now().getTime();
> +    if (!reason || reason == REASON_DAILY) {
> +      this._dailyPingRetryAttempts++;

there's going to be some funkiness with unfortunately timed sleeps.. e.g. we send then computer goes to sleep and pingSuccess comes hours later when computer wakes up, causing us to resend
this is going to be very rare though, so let's ignore it unless it becomes a real problem

@@ +1468,5 @@
>          TelemetryEnvironment.registerChangeListener(ENVIRONMENT_CHANGE_LISTENER,
>                                                      () => this._onEnvironmentChange());
>  #endif
> +        // Check for a previously written aborted session ping.
> +        yield this._checkAbortedSessionPing();

move this down into the #if.. actually you can just merge the two adjacent #ifs

@@ +1480,5 @@
> +        }
> +#endif
> +
> +        // Start the scheduler.
> +        TelemetryScheduler.init();

Move this up into the #if
Attachment #8580642 - Flags: review?(vdjeric)
Attachment #8580642 - Attachment is obsolete: true
Attachment #8580806 - Flags: review?(vdjeric)
Attachment #8580806 - Attachment is obsolete: true
Attachment #8580806 - Flags: review?(vdjeric)
Attachment #8580825 - Flags: review?(vdjeric)
Comment on attachment 8580806 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v7

Review of attachment 8580806 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +456,5 @@
> +   */
> +  _isDailyPingDue: function(nowDate) {
> +    if (isNearMidnight(nowDate)) {
> +      let subsessionLength = Math.abs(nowDate.getTime() - this._lastDailyPingTime);
> +      if (areTimesClose(this._lastDailyPingTime, nowDate.getTime(), SCHEDULER_MIDNIGHT_TOLERANCE_MS)) {

this check would cause a second daily ping at 0:10 when the last ping was at 23:50. 
this should be a comparison of the nearest midnight & the last ping time, within a tolerance of 15 mins.

can you rename the isNearMidnight function to getNearestMidnight(now) and then adjust the code in this function correspondingly?

@@ +459,5 @@
> +      let subsessionLength = Math.abs(nowDate.getTime() - this._lastDailyPingTime);
> +      if (areTimesClose(this._lastDailyPingTime, nowDate.getTime(), SCHEDULER_MIDNIGHT_TOLERANCE_MS)) {
> +        // We've already sent a ping for this midnight.
> +        return false;
> +      } else if (subsessionLength < MIN_SUBSESSION_LENGTH_MS) {

nit: make this the the first check so it's closer to "let subsessionLength = "

@@ +472,5 @@
> +    let todayDate = truncateToDays(nowDate);
> +    // Check that _lastDailyPingTime isn't today nor within SCHEDULER_MIDNIGHT_TOLERANCE_MS of the
> +    // *previous* midnight.
> +    if ((lastDailyPingDate.getTime() != todayDate.getTime()) ||
> +        areTimesClose(this._lastDailyPingTime, todayDate.getTime(), SCHEDULER_MIDNIGHT_TOLERANCE_MS)) {

this should be an AND, not an OR: the daily ping is overdue if "a ping was not sent during the previous midnight" AND "we didn't already send the overdue ping today"

@@ +536,5 @@
> +    // Check if aborted-session ping is due.
> +    let isAbortedPingDue =
> +      (now - this._lastSessionCheckpointTime) >= ABORTED_SESSION_UPDATE_INTERVAL_MS;
> +    // Check if daily ping is due.
> +    let isDailyPingDue = this._isDailyPingDue(nowDate);

give different names to the variable & function
Attachment #8580806 - Attachment is obsolete: false
This patch adds the new tests requested by Vladan and Georg.
Attachment #8580644 - Attachment is obsolete: true
Attachment #8580877 - Flags: review?(gfritzsche)
Attachment #8580806 - Attachment is obsolete: true
Attachment #8580825 - Attachment is obsolete: true
Attachment #8580825 - Flags: review?(vdjeric)
Attachment #8580884 - Flags: review?(vdjeric)
Comment on attachment 8580884 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v8

Review of attachment 8580884 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +93,5 @@
> +const SCHEDULER_MIDNIGHT_TOLERANCE_MS = 15 * 60 * 1000;
> +
> +// Coalesce the daily and aborted-session pings if they are both due within
> +// two minutes from each other.
> +const SCHEDULER_COALESCE_THRESHOLD = 2 * 60 * 1000;

add _MS suffix

@@ +205,5 @@
> +  if (areTimesClose(date.getTime(), lastMidnight.getTime(), SCHEDULER_MIDNIGHT_TOLERANCE_MS)) {
> +    return lastMidnight;
> +  }
> +
> +  let nextMidnightDate = new Date(truncateToDays(date));

nit: you could just do new Date(lastMidnight)

@@ +462,5 @@
> +      let subsessionLength = Math.abs(nowDate.getTime() - this._lastDailyPingTime);
> +      if (subsessionLength < MIN_SUBSESSION_LENGTH_MS) {
> +        // Generating a daily ping now would create a very short subsession.
> +        return false;
> +      } else  if (areTimesClose(this._lastDailyPingTime, nearestMidnight.getTime(),

extra space between "else" and "if"

@@ +466,5 @@
> +      } else  if (areTimesClose(this._lastDailyPingTime, nearestMidnight.getTime(),
> +                  SCHEDULER_MIDNIGHT_TOLERANCE_MS)) {
> +        // We've already sent a ping for this midnight.
> +        return false;
> +      } 

trailing space

@@ +589,5 @@
> +    if (reason == REASON_ENVIRONMENT_CHANGE) {
> +      // We just generated an environment-changed ping, save it as an aborted session and
> +      // update the schedules.
> +      this._saveAbortedPing(now.getTime(), competingPayload);
> +      this._lastSessionCheckpointTime = now.getTime();

so _pingSucceeded() will set the _lastSessionCheckpointTime a second time?
we don't care if an aborted-ping save succeeded or failed. maybe just simplify and set _lastSessionCheckpointTime immediately before calling _saveAbortedPing (anywhere), and not even have _saveAbortedPing call back into _pingSucceeded or _pingFailed.. just have it log a message

@@ +614,5 @@
> +      this._lastSessionCheckpointTime = now;
> +      // We don't have a retry counter for the aborted-session pings, as they get
> +      // saved frequently.
> +      this._dailyPingRetryAttempts = 0;
> +    } else if (reason == REASON_DAILY) {

if you get rid of the callback to _pingSucceeded/Failed for abortedPings, this function becomes a lot simpler.. only one callback reason and you don't even have to pass a reason

@@ +628,5 @@
> +   * @param {Number} now The current time, in milliseconds.
> +   * @param {String} [reason=null] The ping reason. If this is null, it's called for
> +   *                 coalesced pings.
> +   */
> +  _pingFailed: function(now, reason=null) {

ditto for _pingFailed

@@ +1490,5 @@
>  
>          Telemetry.asyncFetchTelemetryData(function () {});
>  
> +        // Check for a previously written aborted session ping.
> +        yield this._checkAbortedSessionPing();

why not put this in the #if?

@@ +1508,3 @@
>  #endif
>  
> +

extra line?

@@ +1966,5 @@
>      this._log.trace("_onEnvironmentChange");
>      let payload = this.getSessionPayload(REASON_ENVIRONMENT_CHANGE, true);
>  
> +    let clonedPayload = Cu.cloneInto(payload, myScope);
> +    TelemetryScheduler.reschedulePings(REASON_ENVIRONMENT_CHANGE, clonedPayload);

oops, is this from Georg's patch?
Attachment #8580884 - Flags: review?(vdjeric) → review+
Thanks Vladan for your patience. This patch fixes the nits and removes the _pingFailed/Success callbcks for the aborted session save.

It also make sure the aborted session functions are not called when shutting down on android and cleans up a remaining old daily timer check in TelemetrySessio.
Attachment #8580884 - Attachment is obsolete: true
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #56)
> @@ +1966,5 @@
> >      this._log.trace("_onEnvironmentChange");
> >      let payload = this.getSessionPayload(REASON_ENVIRONMENT_CHANGE, true);
> >  
> > +    let clonedPayload = Cu.cloneInto(payload, myScope);
> > +    TelemetryScheduler.reschedulePings(REASON_ENVIRONMENT_CHANGE, clonedPayload);
> 
> oops, is this from Georg's patch?

No, that was done so that reschedulePings doesn't need to wait for the environment-change ping to be sent before notifying the scheduler.
Fixed the broken tests on try-push (checks if log is available when shutting down the scheduler).

New try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7dea7fd50b68
Attachment #8581098 - Attachment is obsolete: true
Attachment #8581480 - Flags: review+
Attachment #8580877 - Flags: review?(gfritzsche) → review+
I've pulled and pushed to try again just to double check the mass OSX bustage: https://treeherder.mozilla.org/#/jobs?repo=try&revision=886b50674c0b
No longer blocks: 1143796
Comment on attachment 8581480 [details] [diff] [review]
part 1 - Write aborted-session pings to disk - v10

Review of attachment 8581480 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/TelemetrySession.jsm
@@ +86,5 @@
>  const TELEMETRY_TEST_DELAY = 100;
> +// Execute a scheduler tick every 5 minutes.
> +const SCHEDULER_TICK_INTERVAL_MS = 5 * 60 * 1000;
> +// The maximum number of times a scheduled operation can fail.
> +const SCHEDULER_RETRY_ATTEMPTS = 3;

it's possible (because of sleeps) that we don't get to do the full 3 retries because we're no longer within the midnight tolerance.. i think that's probably ok. A side-effect of that would be that _dailyPingRetryAttempts would not be 0 for the next midnight. Maybe we should reset _dailyPingRetryAttempts on the do-nothing ticks?
This patch takes care of the suggestion by Vladan and fixes a missing negation in |_isDailyPingDue|.
Attachment #8581480 - Attachment is obsolete: true
Attachment #8581721 - Flags: review+
Attachment #8581747 - Flags: review?(gfritzsche)
Comment on attachment 8581747 [details] [diff] [review]
part 4 - Adds test coverage for overdue daily pings.

Review of attachment 8581747 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/telemetry/tests/unit/test_TelemetrySession.js
@@ +1029,5 @@
> +  registerPingHandler((req, res) => {
> +    Assert.ok(false, "No daily ping should be received if not overdue!.");
> +  });
> +
> +  // Run a scheduler tick: it should trigger the daily ping.

Wrong comment.

@@ +1033,5 @@
> +  // Run a scheduler tick: it should trigger the daily ping.
> +  Assert.ok(!!schedulerTickCallback);
> +  yield schedulerTickCallback();
> +
> +  // Restore the non asserting ping handler.

Mention that the Request constructor restores that?
Attachment #8581747 - Flags: review?(gfritzsche) → review+
Thanks Georg.
Attachment #8581747 - Attachment is obsolete: true
Attachment #8581755 - Flags: review+
Marking that as checkin needed, as the try-push is green and the changes since then are very minor.
Keywords: checkin-needed
No longer blocks: 1139751
Blocks: 1143796
You need to log in before you can comment on or make changes to this bug.