Closed Bug 804491 Opened 7 years ago Closed 7 years ago

Policy and scheduling module for Firefox Health Report

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set

Tracking

(firefox18 fixed, firefox19 fixed, firefox20 fixed)

RESOLVED FIXED
Firefox 20
Tracking Status
firefox18 --- fixed
firefox19 --- fixed
firefox20 --- fixed

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(1 file, 8 obsolete files)

The Firefox Health Report has a somewhat complex set of requirements around when data upload is allowed to occur. I have codified these requirements in a standalone "policy" file/type.

The attached patch is my first stab at this. It is provided for feedback purposes only (not full review). I'm confident there are bugs - probably glaring bugs. I'm submitting it for feedback so design decisions can be addressed early in the development process.

Open issues:

* Test coverage is horribly inadequate
* There are likely many bugs
* Should probably invent a Promise type for notifications too
* Am I using promises properly?
* Timers not fully implemented

I will submit a patch for review once test coverage is sane and I'm confident bugs and features have been worked out. Until then, happy high-level reviewing!
Attachment #674143 - Flags: feedback?(rnewman)
I figured I'd upload what I have before I go to bed.

Since last:

* MOAR tests (and they all pass now!)
* Small bug fixes

I'd say the tests are still about 1/3 complete.
Assignee: nobody → gps
Attachment #674143 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #674143 - Flags: feedback?(rnewman)
Attachment #674162 - Flags: feedback?(rnewman)
I'm happy with the test coverage and the feature set and am asking for review.

rnewman gets primary review. Host of others get feedback:

Yoric: Promise usage is sane (first time I've used this library)
gavin: Seer
mconnor: Extra eyes
deinspanjer: Extra eyes (FHR perspective)
mreid: Extra eyes (FHR perspective)

I'm pretty happy with the current state of things. I /think/ I captured all the policy requirements in the code.

My open questions are around whether the constants I chose are sane:

* POLLING_INTERVAL is probably a Gecko f?. This will run always, even if data submission is disabled. I /think/ once a minute for something that will read a few prefs isn't too bad.
* SUBMISSION_NOTIFY_BEFORE_INTERVAL is FHR policy f? I made up 12 hours out of the blue.
* IMPLICIT_ACCEPTANCE_INTERVAL is FHR policy f? I too made up the 5 minute number. Perhaps Privacy should sign off on this?
Attachment #674162 - Attachment is obsolete: true
Attachment #674162 - Flags: feedback?(rnewman)
Attachment #674355 - Flags: review?(rnewman)
Attachment #674355 - Flags: feedback?(mreid)
Attachment #674355 - Flags: feedback?(mconnor)
Attachment #674355 - Flags: feedback?(gavin.sharp)
Attachment #674355 - Flags: feedback?(dteller)
Attachment #674355 - Flags: feedback?(deinspanjer)
This is a DUP of bug 707970. See comments there.
Missing feature: intelligent backoff in case of repeated transport failure. Currently, transport failure relies in retry every minute. We could have persistent transport failure in firewalled networks, for example. We should probably retry at most N times in a 24 hour period. Maybe we can have a gradual backoff for the retry attempts. This could be a follow-up bug/part easily enough.
Comment on attachment 674355 [details] [diff] [review]
Policy and scheduling driver for FHR, v3

Review of attachment 674355 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent work; use of promises is good. Here's a first round of review comments.

::: toolkit/healthreport/policy.jsm
@@ +12,5 @@
> +Cu.import("resource://services-common/utils.js");
> +Cu.import("resource://gre/modules/devtools/_Promise.jsm");
> +
> +
> +const MILLISECONDS_IN_DAY = 86400000;

Nit: personal preference is MILLISECONDS_PER_DAY.

@@ +32,5 @@
> +  /**
> +   * No submission was attempted because no data was available.
> +   */
> +  onNoDataAvailable: function noDataAvailable() {
> +    this.noDateAvailable = true;

Shouldn't that be .noDataAvailable?

@@ +70,5 @@
> + *
> + * The rules of data submission are as follows:
> + *
> + *  1. Do not submit data more than once every 24 hours.
> + *  2. Try to submit as close to 24 hours apart as possible.

This statement of policy doesn't seem too useful to me. It lacks both precision and a measure of importance to use when comparing implementations.

Note that "once per day" and "no more than once per 24 hours" is incompatible with imprecise timing and retries… you'll eventually drift such that day N's submission occurs at the end of day N, and then you'll submit nothing on day N+1.

For example, if we use Android's AlarmManager for submission, it might cause a submission to occur more than once per 24 hours, (otherwise you're demanding a kind of clock drift, and retries will cause violations!).

An alternative, more precise, phrasing might be: "Submissions should occur once per day in normal circumstances, no more than once per 24 hour period, and no sooner than 12 hours after the last submission".

This forces you to address the situation of upload failing for so long that the next day's payload is ready.

@@ +71,5 @@
> + * The rules of data submission are as follows:
> + *
> + *  1. Do not submit data more than once every 24 hours.
> + *  2. Try to submit as close to 24 hours apart as possible.
> + *  3. Do not submit too soon after application startup as to not negatively

s/as/so as

@@ +125,5 @@
> + *        events.
> + */
> +function HealthReportPolicy(prefs, listener) {
> +  this._log = Log4Moz.repository.getLogger("HealthReport.Policy");
> +  this._log.level = Log4Moz.Level["Debug"];

TODO: grab level from prefs.

@@ +132,5 @@
> +  this._listener = listener;
> +
> +  // If we've never run before, record the current time.
> +  if (!this.firstRunDate.getTime()) {
> +    this.firstRunDate = this.now();

Good.

@@ +138,5 @@
> +
> +  // Date at which we performed user notification of acceptance.
> +  // This is an instance variable because implicit acceptance should only
> +  // carry forward through a single application instance.
> +  this._userNotifiedDate = null;

OK, so this is when they were notified. When did they accept? Do we care? If not, why do we care when they were notified?

@@ +149,5 @@
> +
> +HealthReportPolicy.prototype = {
> +  // Time before initial scheduled data submission we should notify about
> +  // data submission policy.
> +  SUBMISSION_NOTIFY_BEFORE_INTERVAL: 60000 * 60 * 12, // 12 hours (milliseconds).

You should name these constants with a unit:

  SUBMISSION_NOTIFY_BEFORE_INTERVAL_MSEC

@@ +151,5 @@
> +  // Time before initial scheduled data submission we should notify about
> +  // data submission policy.
> +  SUBMISSION_NOTIFY_BEFORE_INTERVAL: 60000 * 60 * 12, // 12 hours (milliseconds).
> +
> +  // Time that must ellapse with no action on info bar for implicit acceptance

s/ellapse/elapse

@@ +153,5 @@
> +  SUBMISSION_NOTIFY_BEFORE_INTERVAL: 60000 * 60 * 12, // 12 hours (milliseconds).
> +
> +  // Time that must ellapse with no action on info bar for implicit acceptance
> +  // to be inferred.
> +  IMPLICIT_ACCEPTANCE_INTERVAL: 300000, // 5 minutes (milliseconds).

Please be consistent in how you compute these values. 60000 * 5 would be more consistent. 5 * 60 * 1000 would be better still, assuming you change the other constants.

And I'd be tempted to lift this out, so you'd have

  IMPLICIT_ACCEPTANCE_INTERVAL: FIVE_MINUTES_MSEC,

@@ +163,5 @@
> +  // to not negatively impact performance.
> +  //
> +  // The random bit is to ensure that other systems scheduling around the same
> +  // interval don't all get scheduled together.
> +  POLL_INTERVAL: 60000 + Math.floor(2500 * Math.random()),

Gosh, I hope we can trust our PRNG :)

Again, units, and maybe extract the computation into a function?

@@ +170,5 @@
> +  // up and trying again.
> +  //
> +  // We want this to be short enough that we retry frequently enough but long
> +  // enough to give slow networks and systems time to handle it.
> +  SUBMISSION_REQUEST_EXPIRE_INTERVAL: 600000,

Units, consistency, human-readable label.

@@ +213,5 @@
> +   * When a reaction to user notification was recorded.
> +   *
> +   * If there was implicit acceptance, this will be set to the time of that.
> +   */
> +  get userNotifiedReactionDate() {

Response, not reaction.

@@ +228,5 @@
> +   *
> +   * This is used for logging and diagnostics purposes. It can answer the
> +   * question "how was data submission agreed to on this profile."
> +   *
> +   * Not all values are defined by this type and can come from other systems.

Define a type, at least -- you're limited to what prefs can store, so you don't want a caller to pass in a JS object, or a DOM handle, or…

@@ +271,5 @@
> +
> +  /**
> +   * The state of user notification of the data policy.
> +   *
> +   * This must be COMPLETE before data submission can occur.

s/COMPLETE/STATE_NOTIFY_COMPLETE. Ideally qualify with the type name.

@@ +293,5 @@
> +  /**
> +   * When the last data submission was scheduled to occur.
> +   *
> +   * This is when the last data submission should have occurred, not necessary
> +   * when * it actually occurred. This is used to drive scheduling of recurring

Mis-joined lines!

@@ +337,5 @@
> +    return new Date(lastDate.getTime() + MILLISECONDS_IN_DAY);
> +  },
> +
> +  _getDatePref: function _getDatePref(pref, def) {
> +    return new Date(parseInt(this._prefs.get(pref, def), 10));

If the pref doesn't parse, you'll get back an "Invalid Date" object.

--
[18:00:22.759] new Date(parseInt("asdf", 10))
[18:00:22.762] Invalid Date

Then you'll get NaN for .getTime() on that object. Not nice. You'll want to validate the output of parseInt, probably even to range-check it for sanity.

@@ +421,5 @@
> +    return new Date();
> +  },
> +
> +  /**
> +   * Check state and trigger an actions, if necessary.

Plural agreement.

@@ +431,5 @@
> +   * Typically this function is called automatically by the background polling.
> +   * But, it can safely be called manually as needed.
> +   */
> +  checkStateAndTrigger: function checkStateAndTrigger() {
> +    this._log.debug("Checking to see if there is work to be done...");

This is redundant -- the negative check case already logs, so move this afterwards and make the log message success-specific.

@@ +434,5 @@
> +  checkStateAndTrigger: function checkStateAndTrigger() {
> +    this._log.debug("Checking to see if there is work to be done...");
> +
> +    // If submission is not enabled, we have nothing to notify, not even an info
> +    // bar (why show an info bar if no user action would change anything).

Question mark.

@@ +448,5 @@
> +    // Submission is enabled, so we need to ensure user has been notified that
> +    // submission can occur.
> +    if (notifyState == this.STATE_NOTIFY_UNNOTIFIED) {
> +      // We don't notify until a pre-configured time before the first
> +      // submission because that's how things are.

Pull each of these cases into separate methods. That'll give you a lot more clarity about returning and the structure of the state dispatch, allow you to test each individual branch, and make it more feasible for you to use a switch here instead of if().

@@ +459,5 @@
> +
> +      let promise = new Promise();
> +      promise.onUserAccept = function onUserAccept(reason) {
> +        this.recordUserAcceptance(reason);
> +      }.bind(this);

promise.onUserAccept = this.recordUserAcceptance.bind(this);

?

@@ +471,5 @@
> +        this._userNotifiedDate = now;
> +        this.userNotifiedDate = now;
> +      }.bind(this), function onReject() {
> +        this._log.info("Data submission notification presentation failed.");
> +      }.bind(this));

Can we lay this out a little more clearly?

@@ +486,5 @@
> +    // We're waiting for user action or implicit acceptance after display.
> +    if (notifyState == this.STATE_NOTIFY_WAIT) {
> +      // Check for implicit acceptance.
> +      let implicitAcceptanceDate = new Date(
> +        this._userNotifiedDate.getTime() + this.IMPLICIT_ACCEPTANCE_INTERVAL);

Ugly!

@@ +563,5 @@
> +      this.lastScheduledDataSubmissionDate = promise.submissionDate;
> +      return;
> +    }
> +
> +    if (promise.noDataAvailable) {

Each of these states seems to be mutually exclusive -- either no data was available or there was, and if there was, it failed and we should retry, it failed and we shouldn't, or we succeeded.

I would rather see a promise.state field, perhaps with states:

  "submitted"
    -- look up submission date
  "try-again"
  "skip"
 
That allows you to extend to additional kinds of logic -- hard failure, never try again, try again much later, client exception, try again after the next Firefox update -- without a proliferation of fields.

@@ +578,5 @@
> +      }
> +
> +      this._log.warn("Skipping this day's submission.");
> +      this.lastScheduledDataSubmissionDate = this.now();
> +    }

… which brings us to the lack of failure case here. What if the promise comes back with none of those boolean fields set?

If you were switching on a state, this would be apparent.
Attachment #674355 - Flags: review?(rnewman)
(In reply to Richard Newman [:rnewman] from comment #5)
> @@ +70,5 @@
> > + *
> > + * The rules of data submission are as follows:
> > + *
> > + *  1. Do not submit data more than once every 24 hours.
> > + *  2. Try to submit as close to 24 hours apart as possible.
> 
> This statement of policy doesn't seem too useful to me. It lacks both
> precision and a measure of importance to use when comparing implementations.
> 
> Note that "once per day" and "no more than once per 24 hours" is
> incompatible with imprecise timing and retries… you'll eventually drift such
> that day N's submission occurs at the end of day N, and then you'll submit
> nothing on day N+1.
> 
> For example, if we use Android's AlarmManager for submission, it might cause
> a submission to occur more than once per 24 hours, (otherwise you're
> demanding a kind of clock drift, and retries will cause violations!).
> 
> An alternative, more precise, phrasing might be: "Submissions should occur
> once per day in normal circumstances, no more than once per 24 hour period,
> and no sooner than 12 hours after the last submission".
> 
> This forces you to address the situation of upload failing for so long that
> the next day's payload is ready.

Daniel: can you clarify the requirements?
Flags: needinfo?(deinspanjer)
rnewman, I'm a little confused about your reply and wonder if there is a misunderstanding about the desired frequency of submission.  Saying "no more than once per 24 hour period, and no sooner than 12 hours after the last submission" seems to contradict itself.  We are not concerned with the schedule matching up with midnight to midnight boundaries, only that the elapsed number of hours since previous submission is at least 24.

Here is my stab at a list of requirements.  I think they could probably be simplified based on what scheduling features are available in the client, but I went for verbose to try to be clear.

Submission scheduling requirements:
1. Successfully submit data *at most* once per day
2. If the first submission attempt for a day fails, schedule a retry in one hour.
3. If the second submission attempt for a day fails, give up and wait for at least 24 hours.
4. After a successful submission, a timeout can be set for 24 hours to submit again.
5. One minute after startup, the client should evaluate two conditions and submit if both true.
5a. Has it been more than 24 hours since the previous successful submission?
5b. Has it been more than 24 hours since the most recent second submission failure?

These requirements seem to indicate tracking three variables:
1. Last successful submission timestamp
2. Last failed submission timestamp
3. Whether the last failure was a retry
Flags: needinfo?(deinspanjer)
I think we should retry sooner than an hour, especially in the case of transport failures, which could be the result of a shoddy cell network, etc. And, I think we should try more than 1 retry. For the retry schedule, how about:

* 1 minute
* 5 minutes
* 15 minutes
* 1 hour
* +24h from wherever we are now.

A consequence of this is that submissions will tend to be slightly more than 24h apart. But, I think I'm OK with that. If we ever push a hotfix that encourages people to upgrade, I wouldn't mind some skew in the uploading times.
Comment on attachment 674355 [details] [diff] [review]
Policy and scheduling driver for FHR, v3

Review of attachment 674355 [details] [diff] [review]:
-----------------------------------------------------------------

As you asked, I have only looked at the promises. I would like to suggest a few changes on that side.

::: toolkit/healthreport/policy.jsm
@@ +9,5 @@
> +const {classes: Cc, interfaces: Ci, utils: Cu} = Components;
> +
> +Cu.import("resource://services-common/log4moz.js");
> +Cu.import("resource://services-common/utils.js");
> +Cu.import("resource://gre/modules/devtools/_Promise.jsm");

This module is deprecated. You should rather use resource://gre/modules/commonjs/promise/core.js.
As this does not work exactly as _Promise.jsm, this will require a few additional changes.

@@ +23,5 @@
> + * of the functions on this type, not Promise's own methods.
> + */
> +function DataSubmissionPromise() {
> +  Promise.call(this);
> +}

With "standard" promises (including promise/core.js), one cannot inherit from a promise, or extend a single promise. The rationale is that extending a single promise is semantically unsound, as this cannot be propagated through |then|. Also, restricting the use of promises to trivial promises lets us do nice stuff such as Task.js (see e.g. bug 804184 and diff https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=800697&attachment=674286).

Here, if I understand your code, you will probably rather want to store the promise either as a field or hidden somewhere in a closure.

@@ +93,5 @@
> + *     user to be notified that data submission will occur. The function
> + *     receives a Promise. It should be resolved when the user has been
> + *     notified of data submission or rejected if display of user notification
> + *     could not be completed. The value passed to .resolve() and .reject() is
> + *     ignored.

While this is quite acceptable, it would probably be more idiomatic if both methods returned a promise instead of accepting one.

@@ +462,5 @@
> +        this.recordUserAcceptance(reason);
> +      }.bind(this);
> +      promise.onUserReject = function onUserReject(reason) {
> +        this.recordUserRejection(reason);
> +      }.bind(this);

As mentioned above, extending a promise is not very idiomatic.

@@ +545,5 @@
> +      return;
> +    }
> +
> +    promise.then(
> +      this._handleSubmissionResult.bind(this, promise),

It would be more idiomatic to have |promise| actually resolve to something useful and |_handleSubmissionResult| use that result.
Also, nit: I personally prefer avoiding this kind of currying in JavaScript. YMMV.
Attachment #674355 - Flags: feedback?(dteller)
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #7)
> rnewman, I'm a little confused about your reply and wonder if there is a
> misunderstanding about the desired frequency of submission.

That's why I asked for clarification :D


> Saying "no more
> than once per 24 hour period, and no sooner than 12 hours after the last
> submission" seems to contradict itself.

It defines acceptable rolling windows. For example, this would allow a submission at 8pm on a Monday (Monday's data) and 10am on Tuesday (Tuesday's submission). No more than once per 24-hour period and no sooner than 12 hours means that for each checkmark on a timeline, there is at least 12 hours between any two checkmarks, and each checkmark is within a 24-hour-wide block:

Monday                  Tuesday
------------------------|------------------------
                    |              |
                     ------------
                       \ 12 hr /


> We are not concerned with the
> schedule matching up with midnight to midnight boundaries, only that the
> elapsed number of hours since previous submission is at least 24.

OK, that's fine.

Though do be aware of a consequence of this: most submissions will be significantly less frequent than every 24 hours, particularly on mobile devices, if the strict minimum separation is 24 hours. You'll get drift such that a client submits at 10am on Monday, noon on Tuesday, 5pm on Wednesday, not at all on Thursday, 10am Friday, then 9am the following Monday. Or, worse, you'll get submissions in active windows that coincide with 24-hour boundaries: 10am Monday, ~10am Friday, 10am Monday. This is what Android will do, most likely.

If that's not acceptable, the policy needs to explicitly state that a client should not just submit no fewer than 24 hours after the last submission, but as close to 24 hours as network connectivity and timer resolution allows, regardless of battery life.

Getting precision and clarity on this is why I left that comment -- that written policy is how we'll decide how to implement submission on assorted platforms.


> Submission scheduling requirements:
> 1. Successfully submit data *at most* once per day

Can we eschew the word "day", please? "Once per day" to me means "5pm Monday then 10am Tuesday is fine".

IMO the minimum constraint that you want is expressed by "no sooner than 24 hours after the last submission", so we just need the opposing constraint of target frequency -- "as soon as possible after the 24 hour waiting period has elapsed"? -- then some rule to trade these against other constraints.

> 5. One minute after startup, the client should evaluate two conditions and
> submit if both true.

Bear in mind that our eventual goal for this feature is to not have it care about application startup -- it's a bad time to do background work, and it doesn't make sense for B2G.
> A consequence of this is that submissions will tend to be slightly more than
> 24h apart.

Even on desktop, and even without retries.

If my machine submits in the morning one day, it's almost guaranteed that its next upload will be > 28 hours later, because I tend to do everything on my phone in the mornings.

If it submits at 11pm, and the next day I put my machine to sleep at 10:55, it'll be 36 hours until it next submits.

If a user's machine uploads at 5pm on Thursday, it might be Monday before we hear from them again, because they leave the office early on Friday!

With retries it pushes things out still further: if you're on Caltrain when it tries to submit, all of its attempts will fail for an hour or more, and you've skipped that day.
(In reply to Gregory Szorc [:gps] from comment #8)
> I think we should retry sooner than an hour, especially in the case of
> transport failures, which could be the result of a shoddy cell network, etc.
> And, I think we should try more than 1 retry. For the retry schedule, how
> about:
> 
> * 1 minute
> * 5 minutes
> * 15 minutes
> * 1 hour
> * +24h from wherever we are now.
> 
> A consequence of this is that submissions will tend to be slightly more than
> 24h apart. But, I think I'm OK with that. If we ever push a hotfix that
> encourages people to upgrade, I wouldn't mind some skew in the uploading
> times.

I won't argue against multiple retries on their merit alone, but I just want to make sure we aren't adding too much complexity for not enough gain.  Since the client payload is cumulative, we don't actually lose any data unless there is some situation where the client can never submit one minute after startup and one hour after a failed submit.

If that situation would go away by actually trying again in 5 minutes, then it would probably be worth putting the logic in.

Even if we ended up having most clients submitting data every 25 hours would not be a problem for analysis since the data within the payload is consistent regardless of the submission schedule.



(In reply to Richard Newman [:rnewman] from comment #10)
> (In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #7)
> > We are not concerned with the
> > schedule matching up with midnight to midnight boundaries, only that the
> > elapsed number of hours since previous submission is at least 24.
> 
> OK, that's fine.
> 
> Though do be aware of a consequence of this: most submissions will be
> significantly less frequent than every 24 hours, particularly on mobile
> devices, if the strict minimum separation is 24 hours. You'll get drift such
> that a client submits at 10am on Monday, noon on Tuesday, 5pm on Wednesday,
> not at all on Thursday, 10am Friday, then 9am the following Monday. Or,
> worse, you'll get submissions in active windows that coincide with 24-hour
> boundaries: 10am Monday, ~10am Friday, 10am Monday. This is what Android
> will do, most likely.

Metrics is okay with this expected drift.  It is exactly what we see with the current checks to blocklist.  There is a sort of double hump, a slightly smaller one during peak business hours in Europe, and another that ramps up from business hours Eastern USA and peaks at business hours Pacific USA.

> 
> If that's not acceptable, the policy needs to explicitly state that a client
> should not just submit no fewer than 24 hours after the last submission, but
> as close to 24 hours as network connectivity and timer resolution allows,
> regardless of battery life.
> 
> Getting precision and clarity on this is why I left that comment -- that
> written policy is how we'll decide how to implement submission on assorted
> platforms.
> 
> 
> > Submission scheduling requirements:
> > 1. Successfully submit data *at most* once per day
> 
> Can we eschew the word "day", please? "Once per day" to me means "5pm Monday
> then 10am Tuesday is fine".
> 
> IMO the minimum constraint that you want is expressed by "no sooner than 24
> hours after the last submission", so we just need the opposing constraint of

Ugh. I absolutely agree, it was a mistake for me to use the word day there, and your rewording is fine.


> target frequency -- "as soon as possible after the 24 hour waiting period
> has elapsed"? -- then some rule to trade these against other constraints.

I would be fine with adding the corollary of as soon as possible after the 24 hour waiting period has elapsed.

> 
> > 5. One minute after startup, the client should evaluate two conditions and
> > submit if both true.
> 
> Bear in mind that our eventual goal for this feature is to not have it care
> about application startup -- it's a bad time to do background work, and it
> doesn't make sense for B2G.

I did consider B2G and the background Android service when I was thinking about startup.  For those, I'd define start up as the boot of the device or the start of the scheduling service itself.  Does that make sense?
Here are some of the things I'm considering with trying to go for a very simple retry schedule:

1. Simplicity of the code -- A more elaborate schedule requires more complex code which requires more development effort and testing to get done right.  Deferring that for a v.Next seems like a reasonable consideration to me.

2. Minimizing potential client impact -- This code should be tailored wherever reasonable to have little to no impact on the day to day operation of the client.  Taking extra time to attempt multiple retries should be weighed against that.

3. Minimizing potential server impact -- If there is some catastrophic situation where the server is unavailable due to load, having clients with a more frequent retry strategy is likely to exacerbate the issue versus a very minimal retry.  Even in the case of retrying on every startup and an hour later could potentially have a group of clients who are very frequently restarting hammering the server.
I'd also like to add one more item to address the fact that we're dealing with client-side clocks which can be arbitrarily wrong:

If the client's "previous submission time" is in the future, treat it the same way as if it were >24 hours ago.  In other words, attempt to submit in that case too.  The payload should still include the future timestamp so we can detect this situation on the server side.
(In reply to Richard Newman [:rnewman] from comment #5)
> Comment on attachment 674355 [details] [diff] [review]
> Policy and scheduling driver for FHR, v3

> 
> @@ +125,5 @@
> > + *        events.
> > + */
> > +function HealthReportPolicy(prefs, listener) {
> > +  this._log = Log4Moz.repository.getLogger("HealthReport.Policy");
> > +  this._log.level = Log4Moz.Level["Debug"];
> 
> TODO: grab level from prefs.

Every time I do this pattern I can't help but think we need a unified API in Gecko that does all this stuff automatically. Perhaps once I finish this I can work on that...

I don't think we need per-logger level granularity just yet. I think a single level/filter before writing (when that is implemented) will be sufficient.


> @@ +138,5 @@
> > +
> > +  // Date at which we performed user notification of acceptance.
> > +  // This is an instance variable because implicit acceptance should only
> > +  // carry forward through a single application instance.
> > +  this._userNotifiedDate = null;
> 
> OK, so this is when they were notified. When did they accept? Do we care? If
> not, why do we care when they were notified?

This internal variable is the important one. It's used to detect implicit acceptance.

The external/pref userNotifiedDate is not needed by the code. It is mainly for forensics purposes. If there are ever any questions as to when a user was prompted or if we are trying to debug issues with purported accidental acceptance, we have a nice paper trail to help with that. Given the important of getting this code right, I want to have as much state recorded as possible.


> @@ +153,5 @@
> > +  SUBMISSION_NOTIFY_BEFORE_INTERVAL: 60000 * 60 * 12, // 12 hours (milliseconds).
> > +
> > +  // Time that must ellapse with no action on info bar for implicit acceptance
> > +  // to be inferred.
> > +  IMPLICIT_ACCEPTANCE_INTERVAL: 300000, // 5 minutes (milliseconds).
> And I'd be tempted to lift this out, so you'd have
> 
>   IMPLICIT_ACCEPTANCE_INTERVAL: FIVE_MINUTES_MSEC,

Why? I don't think that abstraction provides anything useful.

> @@ +163,5 @@
> > +  // to not negatively impact performance.
> > +  //
> > +  // The random bit is to ensure that other systems scheduling around the same
> > +  // interval don't all get scheduled together.
> > +  POLL_INTERVAL: 60000 + Math.floor(2500 * Math.random()),
> 
> Gosh, I hope we can trust our PRNG :)
> 
> Again, units, and maybe extract the computation into a function?

I was kinda hoping you would challenge me on the necessity of this. I keep going back and forth. Presumably the REPEATING_SLACK timer already introduces some "randomness."
(In reply to David Rajchenbach Teller [:Yoric] from comment #9)
> Comment on attachment 674355 [details] [diff] [review]
> Policy and scheduling driver for FHR, v3

Thanks for the review!

> @@ +23,5 @@
> > + * of the functions on this type, not Promise's own methods.
> > + */
> > +function DataSubmissionPromise() {
> > +  Promise.call(this);
> > +}
> 
> With "standard" promises (including promise/core.js), one cannot inherit
> from a promise, or extend a single promise. The rationale is that extending
> a single promise is semantically unsound, as this cannot be propagated
> through |then|. Also, restricting the use of promises to trivial promises
> lets us do nice stuff such as Task.js (see e.g. bug 804184 and diff
> https://bugzilla.mozilla.org/page.cgi?id=splinter.
> html&bug=800697&attachment=674286).

Got it. I was kinda worried about this myself. Fragile base class, etc. I'll definitely change this.

> @@ +93,5 @@
> > + *     user to be notified that data submission will occur. The function
> > + *     receives a Promise. It should be resolved when the user has been
> > + *     notified of data submission or rejected if display of user notification
> > + *     could not be completed. The value passed to .resolve() and .reject() is
> > + *     ignored.
> 
> While this is quite acceptable, it would probably be more idiomatic if both
> methods returned a promise instead of accepting one.

My rationale here was to make an easier-to-consume API. In the current form I provide specific functions which must be called. I concede that the current implementation of these functions does little more than the equivalent of passing a simple value to resolve(). But, this functionality will likely expand in the future and I want all the logic to be captured in one place, away from the callers so we don't have to worry about implementations screwing it up.
(In reply to Gregory Szorc [:gps] from comment #16)

> equivalent of passing a simple value to resolve(). But, this functionality
> will likely expand in the future and I want all the logic to be captured in
> one place, away from the callers so we don't have to worry about
> implementations screwing it up.

To capture context in this bug: this is the first of a set of modules that will have contributing tendrils throughout other parts of the codebase, quite probably extensible.

Isolation of complexity is a good goal.


(In reply to Gregory Szorc [:gps] from comment #15)

> > TODO: grab level from prefs.
> 
> Every time I do this pattern I can't help but think we need a unified API in
> Gecko that does all this stuff automatically. Perhaps once I finish this I
> can work on that...

Agreed.

> 
> I don't think we need per-logger level granularity just yet. I think a
> single level/filter before writing (when that is implemented) will be
> sufficient.

Yup, so long as it comes out of prefs eventually! Hence "TODO".

> >   IMPLICIT_ACCEPTANCE_INTERVAL: FIVE_MINUTES_MSEC,
> 
> Why? I don't think that abstraction provides anything useful.

I don't mind exactly how it ends up, so long as the unit appears in the name and the computation appears in the line (or a readable version of the value). That could be

  IMPLICIT_ACCEPTANCE_INTERVAL: FIVE_MINUTES_MSEC
  IMPLICIT_ACCEPTANCE_INTERVAL_MSEC: 5 * 60 * 1000;       // 5 minutes.

On reflection, I prefer the latter. It matches the approach we take in the Android codebase.


> I was kinda hoping you would challenge me on the necessity of this. I keep
> going back and forth. Presumably the REPEATING_SLACK timer already
> introduces some "randomness."

I didn't spend a lot of time in this review considering the interval/uploading timer. That's for round two!
(In reply to Richard Newman [:rnewman] from comment #17)
> I didn't spend a lot of time in this review considering the
> interval/uploading timer. That's for round two!

I'm not sure you should bother. Given the comments in this bug today around scheduling, there will likely be significant refactorings.

I've also been toying with the idea of moving the timer to the main FHR "service" type (not yet implemented except in my brain). We'll presumably have other timer-based code to perform some data collection. I'd rather have one timer/coorindator class than multiple ones (I think).
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #12)

> Metrics is okay with this expected drift.  It is exactly what we see with
> the current checks to blocklist.  There is a sort of double hump, a slightly
> smaller one during peak business hours in Europe, and another that ramps up
> from business hours Eastern USA and peaks at business hours Pacific USA.

Great, good to know.

> I would be fine with adding the corollary of as soon as possible after the
> 24 hour waiting period has elapsed.

Excellent. I think we're converging on some wording!

> I did consider B2G and the background Android service when I was thinking
> about startup.  For those, I'd define start up as the boot of the device or
> the start of the scheduling service itself.  Does that make sense?

Sort of: yes, we'll decide whether to submit as soon as we wake up and have a data connection. There's still a question of triggering a timer.

Consider two extreme implementations of this.

The first is checking on an AlarmManager inexact repeating daily interval. Every 24+ hours (upper range slightly less than 48 hours), we run. Submissions will thus occur every 24+ hours, but depending on the vagaries of network etc., I'd probably guess that 36-72 hours would be possible for some slice of users. This is the most power-friendly, simplest approach.

The second is running an alarm every minute. We're guaranteed to hit the 24-hour point if the phone is awake; if it's not, we'll run as soon as it wakes up. This is cruel to users' batteries.

Knowing how desirable it is for us to do data upload close to the 24-hour mark will allow us to decide on an implementation strategy -- timer duration, piggybacking on other services, etc.
Incorporated most of initial review feedback from rnewman and Yoric. Still need to work a lot on scheduling per other comments.

Not asking for review on this because it is incomplete.
(In reply to Gregory Szorc [:gps] from comment #16)
> > @@ +93,5 @@
> > > + *     user to be notified that data submission will occur. The function
> > > + *     receives a Promise. It should be resolved when the user has been
> > > + *     notified of data submission or rejected if display of user notification
> > > + *     could not be completed. The value passed to .resolve() and .reject() is
> > > + *     ignored.
> > 
> > While this is quite acceptable, it would probably be more idiomatic if both
> > methods returned a promise instead of accepting one.
> 
> My rationale here was to make an easier-to-consume API. In the current form
> I provide specific functions which must be called. I concede that the
> current implementation of these functions does little more than the
> equivalent of passing a simple value to resolve(). But, this functionality
> will likely expand in the future and I want all the logic to be captured in
> one place, away from the callers so we don't have to worry about
> implementations screwing it up.

No problem. Anyway, the rewrites required by porting to promise/core.js should clarify any doubt that users can have about that bit of the API.
Blocks: 718066
No longer blocks: 718067
Component: General → Metrics and Firefox Health Report
Product: Toolkit → Mozilla Services
Blocks: 808109
No longer blocks: 718066
Blocks: 808219
No longer blocks: 808109
I've reworked the scheduling logic. Instead of basing next scheduled time off of previous times, we now explicitly set the next scheduled time in a pref. I /think/ this model is more sane. I can see it both ways.

I'm requesting f? not r? because I just refactored this and haven't finished porting the tests to the new code. I'm hopeful policy.jsm stays intact. Although, I expect testing will reveal bugs around scheduling.

I also added in some backoff code. After first "daily" failure we try again in 15 minutes. After the second failure, we back off by a little more. On the 3rd failure, we give up for the day and schedule for +24h.

The way scheduling is implemented means that ping times will drift and won't occur at exact 24h intervals. Given usage patterns of Firefox, I anticipate that this will result in clients skipping a day's ping once in a while, even if they use Firefox every calendar day. I don't think there is any way getting around this unless we want to schedule pings more frequently than 24h.

I also changed scheduling of the prompting. It is now scheduled in relation to the first run time, not time before the first upload time. I may change this again, as I think the current implementation might be contrary to what Tom and others outlined.

Anyway, here's some new code to look at.
Attachment #674355 - Attachment is obsolete: true
Attachment #675654 - Attachment is obsolete: true
Attachment #674355 - Flags: feedback?(mreid)
Attachment #674355 - Flags: feedback?(mconnor)
Attachment #674355 - Flags: feedback?(gavin.sharp)
Attachment #674355 - Flags: feedback?(deinspanjer)
Attachment #679200 - Flags: feedback?(rnewman)
Cleaned up code. Tests all pass.

There was lots of feedback on earlier patches regarding scheduling. I /think/ I addressed them all in one way or another. At this point, if there are refinements that aren't critical to launch, I'm inclined to defer them to follow-up bugs.

Please eviscerate this patch with your reviews.
Attachment #679200 - Attachment is obsolete: true
Attachment #679200 - Flags: feedback?(rnewman)
Attachment #679256 - Flags: review?(rnewman)
Attachment #679256 - Flags: feedback?(mreid)
Attachment #679256 - Flags: feedback?(deinspanjer)
>+ * If the user never explicitly accepts or rejects the policy, it will be
>+ * implicitly accepted after a specified duration of time. The notice is
>+ * expected to remain displayed even after explicit acceptance (in case the
>+ * user is away from the device). So, no event signaling implicit acceptance
>+ * is exposed.

Do we really want the notice to remain displayed after *explicit* acceptance?  I think that should say "implicit acceptance".
Comment on attachment 679256 [details] [diff] [review]
Policy and scheduling driver for FHR, v6

Review of attachment 679256 [details] [diff] [review]:
-----------------------------------------------------------------

Code looks fine. \o/

::: services/healthreport/policy.jsm
@@ +29,5 @@
> + *
> + * When one of these requests is received, the first thing a callee should do
> + * is present notification to the user of the data policy. When the notice
> + * is displayed to the user, the callee should call `onUserNotifyComplete`.
> + * This begins a countdown timer that when expires will signal implicit

Rework "that when expires".

@@ +108,5 @@
> + * Instances of this are created when the policy requests data submission.
> + * Receivers are expected to call one of the provided on* functions to signal
> + * completion of the request.
> + *
> + * Instances of this type should not be instantiated outside of this file.

So why are you exporting the symbol?

@@ +276,5 @@
> +  /**
> +   *  How often to poll to see if we need to do something.
> +   *
> +   * The interval needs to be short enough such that short-lived applications
> +   * have an opportxpcshelly to submit data. But, it also needs to be long enough

That is the best word ever. Fiddy Cent and G-xpcshell would be proud.

@@ +328,5 @@
> +   * This is used for scheduling of the initial submission.
> +   */
> +  get firstRunDate() {
> +    return CommonUtils.getDatePref(this._prefs, "firstRunTime", 0, this._log,
> +                                   2012);

Lift 2012 into a file-level constant. We might eventually want to track build times on a coarse level, no?

@@ +376,5 @@
> +  /**
> +   * Records the result of user notification of data submission policy.
> +   *
> +   * This is used for logging and diagnostics purposes. It can answer the
> +   * question "how was data submission agreed to on this profile."

Question mark.

@@ +404,5 @@
> +   * never submit data, even if the user has agreed to it.
> +   */
> +  get dataSubmissionEnabled() {
> +    // Default is true because we are opt-out.
> +    return this._prefs.get("dataSubmissionEnabled", true);

You probably want to add this explicitly to prefs, so that a user can see it exists and turn it off.

@@ +524,5 @@
> +   * Data submission will not be allowed to occur if this is called.
> +   *
> +   * This is typically called through the `onUserReject` property attached to
> +   * the promise passed to `onUserNotify` in the policy listener. But, it can
> +   * be called through other interfaces.

State explicitly whether it's acceptable to call recordUserRejection *after* a user has accepted (implicitly or explicitly). Can we use this interface to implement opt-out in prefs? If so, isn't userNotifiedResponseDate the wrong name?

@@ +602,5 @@
> +    }
> +
> +    // User has opted out of data submission.
> +    if (!this.dataSubmissionAccepted) {
> +      this._log.info("Data submission has been disabled per user request.");

This will be written into the log every POLL_INTERVAL_MSEC. It would be nice if that were not the case: debug log instead, and have enabled/disabled visible in prefs/about:healthreport/etc?

@@ +611,5 @@
> +    // comes the scheduling part.
> +
> +    let nextSubmissionDate = this.nextDataSubmissionDate;
> +
> +    if (now.getTime() < nextSubmissionDate.getTime()) {

Call now.getTime() only once. You call it three times.

@@ +779,5 @@
> +      return false;
> +    }
> +
> +    let offset = this.FAILURE_BACKOFF_INTERVALS[this.currentDaySubmissionFailureCount];
> +    this.nextDataSubmissionDate = new Date(this.now().getTime() + offset);

This pattern appears a few times. How about introducing futureDate(msecOffset)?
Attachment #679256 - Flags: review?(rnewman) → review+
(In reply to Richard Newman [:rnewman] from comment #25)
> @@ +108,5 @@
> > + * Instances of this are created when the policy requests data submission.
> > + * Receivers are expected to call one of the provided on* functions to signal
> > + * completion of the request.
> > + *
> > + * Instances of this type should not be instantiated outside of this file.
> 
> So why are you exporting the symbol?

Unit testing. assert isinstance, etc. Although, you can make the case that API guarantees plus interface conformance should be good enough.

> @@ +404,5 @@
> > +   * never submit data, even if the user has agreed to it.
> > +   */
> > +  get dataSubmissionEnabled() {
> > +    // Default is true because we are opt-out.
> > +    return this._prefs.get("dataSubmissionEnabled", true);
> 
> You probably want to add this explicitly to prefs, so that a user can see it
> exists and turn it off.

This will be done in the healthreport code. Default preference files typically live near services/implementations.
Attached file Policy (obsolete) —
First, ignore last patch. Ugh, Bugzilla.

Implemented review feedback. I also added a few more forensic prefs to track last successful, failed, and attempted data submission times. It can't hurt.

At this point I figure feedback from Metrics people can be done as follow-up parts or bugs.
Attachment #679256 - Attachment is obsolete: true
Attachment #679422 - Attachment is obsolete: true
Attachment #679256 - Flags: feedback?(mreid)
Attachment #679256 - Flags: feedback?(deinspanjer)
Attachment #679423 - Flags: review?(rnewman)
Comment on attachment 679423 [details] [diff] [review]
Policy and scheduling driver for FHR, v7

Review of attachment 679423 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine by me. Onward.

Assuming that mconnor is busy today, I'm happy to get retroactive review from him prior to merge.

::: services/healthreport/policy.jsm
@@ +15,5 @@
> +Cu.import("resource://gre/modules/services-common/utils.js");
> +
> +const MILLISECONDS_PER_DAY = 24 * 60 * 60 * 1000;
> +
> +const OLDEST_ALLOWED_YEAR = 2012;

Nit: comment this for the poor sucker reading this file in 2014. E.g., 

  // Used as a sanity lower bound for dates extracted from prefs.
  // This module was implemented in 2012, so any earlier dates indicate
  // an incorrect clock.
Attachment #679423 - Flags: superreview?(mconnor)
Attachment #679423 - Flags: review?(rnewman)
Attachment #679423 - Flags: review+
Naming. Using "dataSubmissionPolicy" everywhere instead of generic "userNotified."
Attachment #679423 - Attachment is obsolete: true
Attachment #679423 - Flags: superreview?(mconnor)
Attachment #679429 - Flags: review?(rnewman)
Comment on attachment 679429 [details] [diff] [review]
Policy and scheduling driver for FHR, v8

Review of attachment 679429 [details] [diff] [review]:
-----------------------------------------------------------------

::: services/healthreport/policy.jsm
@@ +339,5 @@
> +
> +  /**
> +   * When the user was notified that data submission could occur.
> +   *
> +   * This is used for logging purposes. this._userNotifiedDate is what's used

Wanna change the internal name, too, then?
Attachment #679429 - Flags: review?(rnewman) → review+
> +    if (nowT < nextSubmissionDate.getTime()) {
> +      this._log.debug("Next data submission is scheduled in the future: " +
> +                     nextSubmissionDate);
> +      return;
> +    }

If nextSubmissionDate is more than (2 * MILLISECONDS_PER_DAY) in the future,
we should consider that the client's clock is (or was) broken, and reset
nextSubmissionDate to either now or tomorrow.  This would handle the case
where the client's clock was set to the year 2020 long enough to
submit, then corrected.  If we don't detect that situation, that client
would not submit again until 2020.

Alternately we could detect the case where "lastDataSubmissionSuccessfulDate"
is in the future.
Attachment #679429 - Flags: feedback+
Attachment #679429 - Flags: superreview?(mconnor)
Blocks: 809954
Comment on attachment 679429 [details] [diff] [review]
Policy and scheduling driver for FHR, v8

Bulk-setting approval flags for FHR landing for FxOS ADU ping (Bug 788894).
Attachment #679429 - Flags: superreview?(mconnor)
Attachment #679429 - Flags: approval-mozilla-beta?
Attachment #679429 - Flags: approval-mozilla-aurora?
Target Milestone: --- → mozilla20
https://hg.mozilla.org/mozilla-central/rev/bea765d36cad
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Comment on attachment 679429 [details] [diff] [review]
Policy and scheduling driver for FHR, v8

FHR for B2G ADU ping, won't be build/enabled for Mobile/Desktop.
Attachment #679429 - Flags: approval-mozilla-beta?
Attachment #679429 - Flags: approval-mozilla-beta+
Attachment #679429 - Flags: approval-mozilla-aurora?
Attachment #679429 - Flags: approval-mozilla-aurora+
Depends on: 815320
Component: Metrics and Firefox Health Report → Client: Desktop
Flags: in-testsuite+
Product: Mozilla Services → Firefox Health Report
Target Milestone: mozilla20 → Firefox 20
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.