Closed Bug 974009 Opened 6 years ago Closed 6 years ago

Telemetry experiments: test experiment conditions and enable experiments

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 31

People

(Reporter: benjamin, Assigned: gfritzsche)

References

Details

Attachments

(6 files, 27 obsolete files)

1.19 KB, text/plain
Details
5.70 KB, patch
Felipe
: review+
Details | Diff | Splinter Review
6.04 KB, patch
Felipe
: review+
Details | Diff | Splinter Review
3.78 KB, patch
Felipe
: review+
Details | Diff | Splinter Review
24.24 KB, patch
Felipe
: review+
Details | Diff | Splinter Review
52.06 KB, patch
Unfocused
: review+
Details | Diff | Splinter Review
The meaty part: for each experiment, test the experiment conditions and run the experiment if it meets the conditions. This involves both the static conditions expressed in the manifest as well as the dynamic JS function run in a sandbox.

Need some detail about randomization/sampling functions and pref controls.
Real world sampling schemes we have wanted or used in previous experiments (with past implementation):

- x% of eligible users.  Impl: pref:  tp.randomDeploy.somekey=100*random_float()
  * we actually don't like this, but it was cheap.
- N eligible users (N = 2000).  Enroll until you have enough.  (This implies a 'do you have room in experiment x for a user like this one' service / server).
- People (not) in arm/condition X of previous study.  (Specifically: people who were/were-not in the previous study).   in `runOrNot()`... `tp.randomDeplay.sometkey not in interval` 
- condition:  windows Or osx, has prefs x,y,z;  has/not addon x,y,  version > x, etc.  Exact matches "%eq%" might not be quite enough.  %in%, %gt%, %lt%, %like% for all fields is probably enough.  (my mental model here is Mongo Query Lang... http://docs.mongodb.org/manual/reference/method/db.collection.find/#db.collection.find ).
What's specified in the PRD is what we're going to do for v1.  So we're not going to provide N=2000 functionality (and I'm worried about its sampling characteristics anyway).

Most of the other things in comment 1 can be accomplished in this sytem except for *has pref*. Normal pref values are not exposed via telemetry or FHR and are therefore not available as conditions.
Why bother creating static conditions? It seems needlessly complicated to invent an expression language. Let's just execute sandboxed JS everywhere. For the simple conditions, it will be a one-line JS function.

For the eventual "N eligible users" case, we can have the server emit JS to XHR a "gatekeeper" service until the quota is reached, at which point that test disappears from the server response.
tl;dr - I can live with this in v.0, but it has risks that have bitten experiment writers in the past.

+ more expressive

- Those "one lines" are easy to break and hard to verify
- having easy "arm branching" is important here.  The result of this is is not just 'run or not' but 'run which thing'.

If we are having a 'runOrNot' as executable in v0, then I will live with it.

(In my fantasy world, I really want all experiment manifests to be purely declarative "reversible list of steps" with addons but I think I am being overruled).  

We have had real bugs around this is in the past.  

Revised ask:  A really simply exposed 'theUser' object that exposes user info and the prefs.
I want to balance things here: static conditions for simple things like channel/version/buildid are good and we should implement them now.

No requirement for arm branching was expressed in v1 and so we have no plans to implement it now.

As for "theUser" the current spec is for telemetry and FHR objects, which already have fairly well defined object models. I don't want to define a new object model unless it's necessary. And I don't want to expose arbitrary pref values at all.

Let's stick to the script!
Assignee: nobody → georg.fritzsche
Unfocused, getInstallForURL() [1] looks like the right interface for triggering addon installation from a URL.
This also looks like it's suitable for silent addon installation?
The mimetype argument for this is non-optional, is there a reason for that and can we avoid it?

[1] https://developer.mozilla.org/en-US/Add-ons/Add-on_Manager/AddonManager#getInstallForURL%28%29
Flags: needinfo?(bmcbride)
bsmedberg, i think you already had something in mind for setting up a sandboxing environment for the JS filter function.
Do you have any pointers for that?
Flags: needinfo?(benjamin)
(In reply to Georg Fritzsche [:gfritzsche] from comment #6)
> Unfocused, getInstallForURL() [1] looks like the right interface for
> triggering addon installation from a URL.
> This also looks like it's suitable for silent addon installation?

Correct!

> The mimetype argument for this is non-optional, is there a reason for that
> and can we avoid it?

You cannot avoid it - Add-ons Manager needs to know what type of add-on it is. But, for your purposes it'll always be "application/x-xpinstall".


FWIW, the following code for how we install hotfixes should be a good rough guide:
https://hg.mozilla.org/mozilla-central/file/32a41942974c/toolkit/mozapps/extensions/AddonManager.jsm#l1109

Which also covers checking signed add-ons against pinned certificates too.
Flags: needinfo?(bmcbride)
Ben, in V1:

1.  Are addons responsible for uninstall themselves and their own cleanup (I am okay with this!)
2.  What's the guard mechanism for "double installing".  If a user uninstalls for example, we don't want to 'redo' Some possibilities:
  - convention:  that once an experiment addon installs, it sets a pref.
  - installer (TelEx) sets a pref of addons it has installed.  

As usual, the more that experiments are required to do, the more risk there is that builders (i.e., *me* will do it wrong).  

My own hunch leans toward aggressively uninstalling / rolling back, and a 'belt and suspenders' sort of approach.
Flags: needinfo?(benjamin)
(In reply to Gregg Lind (User Research - Test Pilot) from comment #10)
> 1.  Are addons responsible for uninstall themselves and their own cleanup (I
> am okay with this!)

I think at least uninstall has to be triggered from the experiments code anyway for when we need to stop and disable experiments (e.g. opt-out).
Clean-up sounds like it should be the addons job as we can't track everything it does?

> 2.  What's the guard mechanism for "double installing".  If a user
> uninstalls for example, we don't want to 'redo' Some possibilities:
>   - convention:  that once an experiment addon installs, it sets a pref.
>   - installer (TelEx) sets a pref of addons it has installed.  

I'm not sure if i understand the problem right, but we will keep track of experiments for a while (in bug 973997, 180 days were mentioned). We would avoid re-installation based on that.
The experiment system will be responsible for uninstalling the addon.

The addon itself will be responsible for cleaning up any of its custom data but should be able to do this from the uninstall hook in bootstrap.js. NEEDINFO on Unfocused to double-check my assertion there!
Flags: needinfo?(benjamin) → needinfo?(bmcbride)
I think you are both understanding it how I do.  The times when TelEx would need to do something include:

1.  The 'maxExperimentTime' is exceeded -->  addonManager.uninstall(id)
2.  If there is a 'kill' override somewhere --> addonManager.uninstall(id)
3.  Someone opts out of telemetry?  --> addonManager.uninstall(id)
Yes. Also if the conditions stop matching.
Pointed out during standup today: when a new version of an existing experiment is installed, the bootstrap.js function install reason should be ADDON_UPGRADE. This probably works automatically, but we should test and if there's an issue please file a followup.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #12)
> The experiment system will be responsible for uninstalling the addon.
> 
> The addon itself will be responsible for cleaning up any of its custom data
> but should be able to do this from the uninstall hook in bootstrap.js.
> NEEDINFO on Unfocused to double-check my assertion there!

Correct. This will even work if the add-on is disabled. Note, however, that this only support synchronous code at the moment. If you need async cleanup you'll need another solution (bug 771441 would add support for async, but I don't have the cycles to implement it ATM - if you need async, doing your own solution would be quicker).

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #15)
> Pointed out during standup today: when a new version of an existing
> experiment is installed, the bootstrap.js function install reason should be
> ADDON_UPGRADE. 

Correct.
Flags: needinfo?(bmcbride)
Blocks: 981842
Attached file API sketch (obsolete) —
Sketch for the API we should need (which will later be available in proper docs).
Georg, maybe too late, but how about a "more info" key on the experiment?
Also, for 981842, how do I hit this thing programmatically?  Do I get it out of the FHR json packet?
Attached patch Ugly WIP patch for bsmedberg (obsolete) — Splinter Review
(In reply to Gregg Lind (User Research - Test Pilot) from comment #18)
> Georg, maybe too late, but how about a "more info" key on the experiment?

Hm, what would that be for?
(In reply to Gregg Lind (User Research - Test Pilot) from comment #19)
> Also, for 981842, how do I hit this thing programmatically?  Do I get it out
> of the FHR json packet?

Ah, you can access the object that implements the API from the service:
  let experiments = Cc["@mozilla.org/browser/experiments-service;1"]
    .getService(Ci.nsISupports).wrappedJSObject.experiments;
Isn't the API we want the Experiments.getExperiments() API from the API sketch? So it would be something like:

Cu.import("resource:///modules/Experiments.jsm");
promise = Experiments.getExperiments();
...
1.  "more info" is a link url for the explanatory page about the experiment with any and all extra info about the experiment.  If the description is url aware, that is fine too, but why not make it explicit?

2.  Thanks for the code link!
(In reply to Gregg Lind (User Research - Test Pilot) from comment #24)
> 1.  "more info" is a link url for the explanatory page about the experiment
> with any and all extra info about the experiment.  If the description is url
> aware, that is fine too, but why not make it explicit?

Hm, i don't see this in bug 973992, do we need this in v1?
I think I completely missed it in spec'ing :)  Ben, ruling on this?

**Proposal:   add "information [required]" key to experiments**

pros:
- keeps experimenters honest, in that they have to write a full page describing things (data collected, goals, etc.)
- explicit

cons:
- another thing to do.

**Alternative:  cram it into description and promote `urls` into links at display.**

pro - built
con - promoting links at display time is gross.



This would show up in `about:addons` and `about:support`

I think I missed this completely in thinking about the spec.
Can we just slurp this out of the install.rdf <em:homepageURL> and save it the same way we're saving the title (& description?).
That sounds good, we'll only need it for experiments that were or are active anyway.
Attached file API sketch (obsolete) —
I talked with bsmedberg and we will have something like comment 23 as the way to access Experiments.
Updated the API sketch to include an example for that.
Attachment #8389103 - Attachment is obsolete: true
Attached patch Another ugly WIP (obsolete) — Splinter Review
This implements/fakes the API above, so it should be good to test against for about:support and the addon manager.
Attachment #8389780 - Attachment is obsolete: true
Attached file API sketch
Minor change per Greggs suggestion: endDate is an integer (epoch-ms) now.
Attachment #8389867 - Attachment is obsolete: true
Attached patch WIP (obsolete) — Splinter Review
This is looking more complete now - it is mostly missing scheduling timers (for re-evaluation the experiments etc.) and extended tests. Probably a little broken because it's late now.

Would love to get feedback though so i can incorporate it timely.
Attachment #8389870 - Attachment is obsolete: true
Attachment #8390131 - Flags: feedback?(bmcbride)
Attachment #8390131 - Flags: feedback?(gps)
Comment on attachment 8390131 [details] [diff] [review]
WIP

Hey felipe, Unfocused might get to this tomorrow, do you maybe have some feedback or criticism at this point that i could work in today?
Attachment #8390131 - Flags: feedback?(felipc)
Attached patch WIP (obsolete) — Splinter Review
Attachment #8390131 - Attachment is obsolete: true
Attachment #8390131 - Flags: feedback?(gps)
Attachment #8390131 - Flags: feedback?(felipc)
Attachment #8390131 - Flags: feedback?(bmcbride)
Attachment #8390477 - Flags: feedback?(felipc)
Attachment #8390477 - Flags: feedback?(bmcbride)
Attached patch Tests for condition evaluation (obsolete) — Splinter Review
@Gregg, this test could use extension and should be relatively straight-forward.
We should cover all the specified conditions from bug 973997 in this.
Attached patch WIP (obsolete) — Splinter Review
Attachment #8390477 - Attachment is obsolete: true
Attachment #8390477 - Flags: feedback?(felipc)
Attachment #8390477 - Flags: feedback?(bmcbride)
Attachment #8390534 - Flags: feedback?(felipc)
Attachment #8390534 - Flags: feedback?(bmcbride)
Attached patch Tests for condition evaluation (obsolete) — Splinter Review
Attachment #8390478 - Attachment is obsolete: true
Update the above as i needed to make more things async for the jsfilter context.
Note that i started to fold both the patches for bug 973999 and this bug together as it got hard to keep required changes apart.
Comment on attachment 8390477 [details] [diff] [review]
WIP

>diff --git a/browser/experiments/Experiments.jsm b/browser/experiments/Experiments.jsm

>+this.EXPORTED_SYMBOLS = [
>+  "Experiments",
>+  "ExperimentEntry",

We don't need to export ExperimentEntry.

>+const PREF_BRANCH              = "experiments.";

>+XPCOMUtils.defineLazyGetter(this, "gLogAppenderDump", function() {
>+  return new Log.DumpAppender(new Log.BasicFormatter());
>+});

Can we do this work in configureLogging() instead of making a lazy getter here?

>+// Takes an array of promises and returns a promise that is resolved once all of
>+// them are rejected or resolved.

Does this mean that any rejections will be silently discarded? What data is it resolved with? It appears to be resolved with the original array of promises, although this code doesn't appear to use the result value at all so maybe null/undefined would be better?

>+// Loads a JSON file using OS.file. file is a string representing the path
>+// of the file to be read, options contains additional options to pass to
>+// OS.File.read.
>+// Returns a Promise resolved with the json payload or rejected with
>+// OS.File.Error

Or rejected with JSON parsing errors, no?

>+  // Pending manifest and cache operation tasks, they should not interfere
>+  // with each other.

I'm not sure what this comment means.

>+  init: function () {
>+    configureLogging();
>+
>+    Services.obs.addObserver(this, "xpcom-shutdown", false);

This is a strong observer, but the uninit() method doesn't remove it. This will cause the object to live as long as the process, which doesn't seem ideal.

>+    this._experiments = new Map();

Document that this is a map of string->ExperimentEntry and that it contains entries from both the current user's experiment history as well as current manifest items.

>+  observe: function (subject, topic, data) {
>+    switch(topic) {

nit, space after switch (control statement)

>+    case "profile-after-change":
>+      this.init();
>+      gLogger.trace("Experiments::observe() - profile-after-change");
>+      break;

Is it ever the case that an Experiments() object will be initialized before the profile is active? That surprises me, and I suspect we should just avoid that case completely.

>+    case "xpcom-shutdown":

This should happen before xpcom-shutdown. profile-before-change seems like the right location. However, it seems that there could be a pending saveToCache promise which really must be resolved before we shut down the profile. It seems like you really ought to be using AsyncShutdown.jsm for this instead of the raw observer topic (since uninit returns a promise).

>+  /**
>+   * Toggle whether the experiments feature is enabled or not.
>+   */
>+  toggleExperimentsEnabled: function (enabled) {
>+    gLogger.trace("Experiments::toggleExperimentsEnabled(" + enabled + ")");
>+    gPrefs.set(PREF_ENABLED, enabled);
>+  },

What is this for, and how does it interact with the global pref observer which does approximately the same thing? Is this intended only for use within tests?

>+
>+  _toggleExperimentsEnabled: function (enabled) {
>+    gLogger.trace("Experiments::_toggleExperimentsEnabled(" + enabled + ")");
>+    gExperimentsEnabled = enabled;
>+    if (!enabled) {
>+      this._disableExperiments();
>+    } else {
>+      this.updateManifest();
>+    }
>+
>+    this._notifyObservers();

Why are we firing the topic immediately in both cases? I would expect to fire the topic immediately if we are disabling experiments, but if we are enabling experiments I wouldn't expect us to fire the topic until we succeed in fetching the manifest.

>+  /**
>+   * Returns a promise that is resolved with an array of `ExperimentInfo` objects,
>+   * which provide info on the currently and recently active experiments.
>+   *
>+   * The experiment info is of the form:
>+   * {
>+   *   id: <string>,
>+   *   name: <string>,
>+   *   description: <string>,
>+   *   active: <boolean>,
>+   *   endDate: <integer>, // epoch ms
>+   *   ... // possibly extended later
>+   * }

File a followup for the details link if we're not going to do it now?

The list should be in chronological order so that consumers don't have to sort it. You can Array.sort the list after you've collected it.

>+  getExperiments: function () {
>+    let list = [];
>+
>+    for (let [id, experiment] of this._experiments) {

Don't we need to make sure that we've fetched this._experiments before collecting this list? So use Task to call/make sure _loadFromCache is complete before continuing here.

>+  /*
>+   * Fetch an updated list of experiments and trigger experiment updates.
>+   *
>+   * @return Promise<>
>+   *         The promise is resolved when the manifest and experiment list is updated.

>+  updateManifest: function () {
>+    gLogger.trace("Experiments::updateManifest()");
>+
>+    if (!gExperimentsEnabled) {
>+      return Promise.reject(new Error("experiments are disabled"));
>+    }

Why is this a reject() instead of resolve()? Are we not supposed to call this method if experiments are disabled?

>+    let uri = Services.urlFormatter.formatURLPref(PREF_BRANCH + PREF_MANIFEST_URI);
>+    if (uri == "about:blank")

I'm not sure this error-checking is necessary: urlformatter already logs an error in this case and the code below should work correctly with about:blank.

>+    this._pendingTasks.updateManifest = Task.spawn(function () {
>+      // Don't interfere while we're saving or loading the cache.

Why? Aren't they independent operations? I can understand blocking loadFromCache/saveToCache on those pending operations, but updateManifest seems independent.

>+  /*
>+   * Helper function to make HTTP GET requests. Returns a promise that is resolved with
>+   * the responseText when the request is complete.
>+   */
>+  _httpGetRequest: function (url) {

It seems like this could be a standalone helper function.

As noted on IRC, _updateExperiments can't blindly remove experiments from _experiments which are no longer in the manifest. Instead of create a new _manifests, I think it should update the existing map and remove entries only if 1) they are no longer in the manifest and 2) we never ran them or 3) they finished more than 180 days ago.

The frozen behavior doesn't seem to match the spec (it seems to be the same as disabled, currently). frozen should not disable a currently-running experiment, but disabled should.

>+  /*
>+   * Returns a promise that is resolved when all _pendingTasks are resolved
>+   * or rejected.
>+   * exceptThese is optional and allows to specify exceptions of tasks not to
>+   * wait on.
>+   */
>+  _pendingTasksDone: function (exceptThese) {

Having an exception list feels backwards, and since there are only three pending exceptions ever, please just make the callers be explicit and pass the tasks they actually are waiting on:

  allResolvedOrRejected([this._updateManifestTask, this._loadFromCacheTask])

This will simplify the code, make it more explicit and easier to read at the call sites.

>+  /*
>+   * Cache the experiments manifest file on disk. Returns a promise that is
>+   * resolved when the manifest was written to disk.
>+   */

Are we saving the manifest data or the experiment history to disk here, or both?

>+  /*
>+   * Stop running experiments and disable further activations.
>+   */
>+  _disableExperiments: function () {
>+    gLogger.trace("Experiments::disableExperiments()");
>+  },

Missing code? Doesn't this require actually .stop()ping the currently-active ExperimentEntry?

>+
>+  /**
>+   * Disable an experiment by id.
>+   * @param experimentId The id of the experiment.
>+   * @return Promise<> Promise that will get resolved once the task is done or failed.
>+   */
>+  disableExperiment: function (experimentId) {
>+    gLogger.trace("Experiments::disableExperiment() - " + experimentId);
>+
>+    let activeExperiment = this._getActiveExperiment();

Wouldn't it be better to do this as this._experiments[id].stop() ?

>+  /*
>+   * Check applicability of experiments, disable the active one if needed and
>+   * activate the first applicable candidate.
>+   * @return Promise<boolean> Resolved when done, the value indicates whether
>+   *                          we activated an experiment.

What does it return if there was an experiment active and it remained active? Or if there was an experiment active and we disabled it?

>+  /*
>+   * Schedule the soonest re-check of experiment applicability that is needed.
>+   */
>+  _scheduleExperimentEvaluation: function () {
>+    let date = 0;
>+    let now = this._policy.now().getTime();
>+
>+    for (let [id, experiment] of this._experiments) {
>+      let scheduleDate = experiment.getScheduleDate();
>+      date = Math.min(date, scheduleDate);
>+    }
>+
>+    // TODO: schedule timer for |date|

Missing code, or followup? Should be pretty simple with nsITimer.

>+  _notifyObservers: function () {
>+    Services.obs.notifyObservers(null, OBSERVER_TOPIC, null);
>+  },

This method seems silly, can you just move it to the callers?

>+Experiments.ExperimentEntry = function (policy) {

As noted earlier, I don't *think* you need to export this and so it can just be "function ExperimentEntry()" here.

>+  this._enabled = false;
>+  this._startDate = null;
>+  this._manifestData = null;
>+  this._needsUpdate = false;
>+  this._randomValue = null;
>+  this._lastChangedDate = null;

Comment these. _enabled is the current enabled status of the experiment, right?
_startDate is the date the experiment was started for this user, etc.
I'm really not sure what _needsUpdate or _lastChangedDate represent (maybe will be clear reading later code, but it should be clear at declaration time).

>+  MANIFEST_OPTIONAL_FIELDS: new Set([
>+    "maxStartTime",
>+    "appName",

should be required

>+    "channel",

This should be required as well, thinking about possible deployment mistakes!

>+  get endDate() {
...
>+    return new Date().setTime(endTime);

This doesn't seem to be what you want. This returns the endTime integer and not the date object.

>+  /*
>+   * Initialize entry from the cache.
>+   * @param data The entry data from the cache.
>+   * @return boolean Whether initialization succeeded.
>+   */
>+  initFromCacheData: function (data) {

>+
>+    this._randomValue = Math.random();

Don't we have to cache the random value?

>+  /*
>+   * Has the experiment data from the manifest changed?
>+   * @param data The experiment data from the manifest.
>+   * @return boolean Whether the experiment data from the manifest has changed.
>+   */
>+  hasChanged: function (newData) {

This method doesn't appear to be used anywhere.

>+  /*
>+   * Update from the experiment data from the manifest.
>+   * @param data The experiment data from the manifest.
>+   * @return boolean Whether updating succeeded.

Since this method never fails, the return value seems unnecessary.

>+   */
>+  updateFromFromManifestData: function (data) {

"FromFrom"? Also don't we need to validate the data?

>+  /*
>+   * Run the jsfilter function from the manifest in a sandbox and return the
>+   * result (forced to boolean).
>+   */
>+  _runFilterFunction: function (jsfilter) {
>+    gLogger.trace("ExperimentEntry::runFilterFunction() - filter: " + jsfilter);
>+
>+    const systemPrincipal = Cc["@mozilla.org/nullprincipal;1"].createInstance(Ci.nsIPrincipal);

Perhaps you mean "const nullPrincipal"? The system principal is all-powerful and not what you're actually getting.

>+    let context = {
>+      healthReportEnabled: false,
>+      healthReportSubmissionEnabled: false,
>+      telemetryEnabled: false,
>+      healthReportPayload: null,
>+      telemetryPayload: null,
>+    };

Are these TODO items?

>+
>+    Cu.evalInSandbox(jsfilter, sandbox);
>+    sandbox.context = context;
>+    let result = Cu.evalInSandbox("!!filter(context)", sandbox);
>+    Cu.nukeSandbox(sandbox);
>+
>+    return result === true;

The evalInSandbox needs a try/catch. And I think we should use truthy values not ===, so

let result = !!Cu.evalInSandbox(...)

>diff --git a/browser/experiments/Experiments.manifest b/browser/experiments/Experiments.manifest

>\ No newline at end of file

nit, fix

>diff --git a/browser/experiments/ExperimentsService.js b/browser/experiments/ExperimentsService.js

>+  notify: function (aTimer) {
>+    dump("ExperimentsService::notify()");

nit, remove all dumps

>+  observe: function (aSubject, aTopic, aData) {
>+    switch(aTopic) {

nit, space

>+    case "xpcom-shutdown":
>+      dump("ExperimentsService::observe() - xpcom-shutdown");
>+      Services.obs.removeObserver(this, "xpcom-shutdown");
>+      this._experiments.uninit();
>+      break;

Because Experiments objects register their own shutdown observers, none of this xpcom-shutdown notification is necessary.

Finally, none of this takes into account the toolkit.telemetry.enabled (or toolkit.telemetry.enabledPreRelease) prefs: this whole system must gate on telemetry being enabled.
Attachment #8390477 - Attachment is obsolete: false
Attachment #8390477 - Attachment is obsolete: true
Depends on: 983226
Depends on: 983231
Comment on attachment 8390534 [details] [diff] [review]
WIP

Review of attachment 8390534 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +943,5 @@
> +
> +      install.addListener({
> +        onDownloadEnded: install => {
> +          gLogger.trace("ExperimentEntry::start() - download ended for " + this._id);
> +          let addon = install.addon;

We /may/ want to explicitly set |addon.type = "experiment"| here.

Actually, if we do that, it should be done in the onInstallStarted callback, since I believe the XPI/install may be returned from a cache if it has been previously fetched.

You'll want to ask Unfocused if modifying addon.type post instantiation is acceptable.
(In reply to Gregory Szorc [:gps] from comment #41)
> Comment on attachment 8390534 [details] [diff] [review]
> WIP
> 
> Review of attachment 8390534 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: browser/experiments/Experiments.jsm
> @@ +943,5 @@
> > +
> > +      install.addListener({
> > +        onDownloadEnded: install => {
> > +          gLogger.trace("ExperimentEntry::start() - download ended for " + this._id);
> > +          let addon = install.addon;
> 
> We /may/ want to explicitly set |addon.type = "experiment"| here.
> 
> Actually, if we do that, it should be done in the onInstallStarted callback,
> since I believe the XPI/install may be returned from a cache if it has been
> previously fetched.
> 
> You'll want to ask Unfocused if modifying addon.type post instantiation is
> acceptable.

Looking at XPIProvider.jsm a bit more, I don't think this is safe. There is code in the low-level loadManifestFromRDF() that keys off addon.type for setting other add-on options, such as whether bootstrap (restartless add-on) is enabled. There is no extension point here and I think mucking with addon.type after other things have been set depending on it is very fragile.

It's best for us to set the type in the XPI properly.

I can make a case for Experiments.jsm to cancel installation if the downloaded add-on isn't the proper type. Multi-layered checking FTW.
TL;DR: ni? mainly for the task design.

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #40)
> >+// Takes an array of promises and returns a promise that is resolved once all of
> >+// them are rejected or resolved.
> 
> Does this mean that any rejections will be silently discarded? What data is
> it resolved with? 

Yes, silently discarded from the caller side. It's harder to figure out how to properly propagate failures if you want all to get resolved and i don't need it for the use-case here (synchronizing/serializing tasks).

> It appears to be resolved with the original array of
> promises, although this code doesn't appear to use the result value at all
> so maybe null/undefined would be better?

Indeed, done.

> >+  // Pending manifest and cache operation tasks, they should not interfere
> >+  // with each other.
> 
> I'm not sure what this comment means.

This holds tasks that hit the experiments list and files. We need to serialize them, so only one is running at a time. Clarified.

> >+    Services.obs.addObserver(this, "xpcom-shutdown", false);
[...]
> >+    case "profile-after-change":

Will remove, ExperimentsService takes care of these.

> >+    case "xpcom-shutdown":
> 
> This should happen before xpcom-shutdown. profile-before-change seems like
> the right location. However, it seems that there could be a pending
> saveToCache promise which really must be resolved before we shut down the
> profile. It seems like you really ought to be using AsyncShutdown.jsm for
> this instead of the raw observer topic (since uninit returns a promise).

Ok, let me look into this.

> 
> >+  /**
> >+   * Toggle whether the experiments feature is enabled or not.
> >+   */
> >+  toggleExperimentsEnabled: function (enabled) {
> >+    gLogger.trace("Experiments::toggleExperimentsEnabled(" + enabled + ")");
> >+    gPrefs.set(PREF_ENABLED, enabled);
> >+  },
> 
> What is this for, and how does it interact with the global pref observer
> which does approximately the same thing? Is this intended only for use
> within tests?

One is the public API, while the observer does the actual work (i.e. stopping experiments, ...).

> 
> >+
> >+  _toggleExperimentsEnabled: function (enabled) {
> >+    gLogger.trace("Experiments::_toggleExperimentsEnabled(" + enabled + ")");
> >+    gExperimentsEnabled = enabled;
> >+    if (!enabled) {
> >+      this._disableExperiments();
> >+    } else {
> >+      this.updateManifest();
> >+    }
> >+
> >+    this._notifyObservers();
> 
> Why are we firing the topic immediately in both cases? I would expect to
> fire the topic immediately if we are disabling experiments, but if we are
> enabling experiments I wouldn't expect us to fire the topic until we succeed
> in fetching the manifest.

Right, sounds reasonable - i've changed this to only fire on disabling or when the activity of an experiment changed (started or stopped).

> >+   * The experiment info is of the form:
[...]
> File a followup for the details link if we're not going to do it now?

I just missed adding it to the objects there, fixing.

> >+  /*
> >+   * Fetch an updated list of experiments and trigger experiment updates.
> >+   *
> >+   * @return Promise<>
> >+   *         The promise is resolved when the manifest and experiment list is updated.
> 
> >+  updateManifest: function () {
> >+    gLogger.trace("Experiments::updateManifest()");
> >+
> >+    if (!gExperimentsEnabled) {
> >+      return Promise.reject(new Error("experiments are disabled"));
> >+    }
> 
> Why is this a reject() instead of resolve()? Are we not supposed to call
> this method if experiments are disabled?

Hm, i wouldn't expect it getting called. Alternatively we can resolve(undefined) and document that?

> 
> >+    this._pendingTasks.updateManifest = Task.spawn(function () {
> >+      // Don't interfere while we're saving or loading the cache.
> 
> Why? Aren't they independent operations? I can understand blocking
> loadFromCache/saveToCache on those pending operations, but updateManifest
> seems independent.

They all hit the experiments list in the end etc., so we should have them all serialized.

> As noted on IRC, _updateExperiments can't blindly remove experiments from
> _experiments which are no longer in the manifest. Instead of create a new
> _manifests, I think it should update the existing map and remove entries
> only if 1) they are no longer in the manifest and 2) we never ran them or 3)
> they finished more than 180 days ago.

Fixed. Still a new Map though, which means it honors manifest (i.e. insertion) order.

> >+  /*
> >+   * Returns a promise that is resolved when all _pendingTasks are resolved
> >+   * or rejected.
> >+   * exceptThese is optional and allows to specify exceptions of tasks not to
> >+   * wait on.
> >+   */
> >+  _pendingTasksDone: function (exceptThese) {
> 
> Having an exception list feels backwards, and since there are only three
> pending exceptions ever, please just make the callers be explicit and pass
> the tasks they actually are waiting on:
> 
>   allResolvedOrRejected([this._updateManifestTask, this._loadFromCacheTask])
> 
> This will simplify the code, make it more explicit and easier to read at the
> call sites.

Hm, i could change this, but i'm wary of introducing traps later on. This way i just add
other tasks to the task group and i'm done. With explicit listings i need to scan everything
to see where updates might be needed (and possibly introduce errors).

> >+    // TODO: schedule timer for |date|
> 
> Missing code, or followup? Should be pretty simple with nsITimer.

Coming up.

> >+  this._enabled = false;
> >+  this._startDate = null;
> >+  this._manifestData = null;
> >+  this._needsUpdate = false;
> >+  this._randomValue = null;
> >+  this._lastChangedDate = null;
> 
> Comment these. _enabled is the current enabled status of the experiment,
> right?
> _startDate is the date the experiment was started for this user, etc.
> I'm really not sure what _needsUpdate or _lastChangedDate represent (maybe
> will be clear reading later code, but it should be clear at declaration
> time).

Documented.

> >+  MANIFEST_OPTIONAL_FIELDS: new Set([
> >+    "maxStartTime",
> >+    "appName",
> 
> should be required
> 
> >+    "channel",
> 
> This should be required as well, thinking about possible deployment mistakes!

Ah, i missed the comment by just going from the spec patch.

> >+  /*
> >+   * Initialize entry from the cache.
> >+   * @param data The entry data from the cache.
> >+   * @return boolean Whether initialization succeeded.
> >+   */
> >+  initFromCacheData: function (data) {
> 
> >+
> >+    this._randomValue = Math.random();
> 
> Don't we have to cache the random value?

It is cached - the whole entry is getting saved/loaded from/to disk.

> 
> >+    let context = {
> >+      healthReportEnabled: false,
> >+      healthReportSubmissionEnabled: false,
> >+      telemetryEnabled: false,
> >+      healthReportPayload: null,
> >+      telemetryPayload: null,
> >+    };
> 
> Are these TODO items?

Healthreporter was already added in a newer patch, telemetry support is filed as bug 983226.
Flags: needinfo?(benjamin)
(In reply to Gregory Szorc [:gps] from comment #42)
> I can make a case for Experiments.jsm to cancel installation if the
> downloaded add-on isn't the proper type. Multi-layered checking FTW.

Check done.
> > Why is this a reject() instead of resolve()? Are we not supposed to call
> > this method if experiments are disabled?
> 
> Hm, i wouldn't expect it getting called. Alternatively we can
> resolve(undefined) and document that?

No that's ok, we should just be clear about it in the comment, or even throw the exception up front rather than just rejecting the promise, since it's really a sign of a programming error.

> Fixed. Still a new Map though, which means it honors manifest (i.e.
> insertion) order.

Oh! Please document that the insertion order of the map is important: I totally missed that detail reading this the first time.

> > Having an exception list feels backwards, and since there are only three
> > pending exceptions ever, please just make the callers be explicit and pass
> > the tasks they actually are waiting on:
> > 
> >   allResolvedOrRejected([this._updateManifestTask, this._loadFromCacheTask])
> > 
> > This will simplify the code, make it more explicit and easier to read at the
> > call sites.
> 
> Hm, i could change this, but i'm wary of introducing traps later on. This
> way i just add
> other tasks to the task group and i'm done. With explicit listings i need to
> scan everything
> to see where updates might be needed (and possibly introduce errors).

Hrm. There are traps either way, but I think I'd prefer the explicit form to this one in any case. Not sure if Felipe has other thoughts.

> > >+  /*
> > >+   * Initialize entry from the cache.
> > >+   * @param data The entry data from the cache.
> > >+   * @return boolean Whether initialization succeeded.
> > >+   */
> > >+  initFromCacheData: function (data) {
> > 
> > >+
> > >+    this._randomValue = Math.random();
> > 
> > Don't we have to cache the random value?
> 
> It is cached - the whole entry is getting saved/loaded from/to disk.

I don't see it happening. We initialize _randomValue in two places:

In initFromManifestData() we use this._policy.random()
In initFromCacheData() we use Math.random() (should probably use this._plicy.random())

But in no place do we seem to set that value by reading it from disk. In fact the same thing appears to be true for _startDate, _name, _description, _homepageURL. And what's up with _endDate that only has one reference in the patch?
Flags: needinfo?(benjamin)
Blocks: 983360
Blocks: 974024
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #45)
> > > Don't we have to cache the random value?
> > 
> > It is cached - the whole entry is getting saved/loaded from/to disk.
> 
> I don't see it happening. We initialize _randomValue in two places:

... uh, thanks for the catch. Fixed.
Attached patch Test fixup (obsolete) — Splinter Review
I spent a few minutes working on tests, but I'm not convinced that we should be writing these as xpcshell tests... there is too much surrounding environment such as the addon manager which isn't initialized properly. Anyway, here's what I did to get healthreport stuff working: I had to fake some app-startup notifications and send some missing observer notifications from the xpcshell harness.
Attachment #8390833 - Attachment is patch: true
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #47)
> Created attachment 8390833 [details] [diff] [review]
> Test fixup
> 
> I spent a few minutes working on tests, but I'm not convinced that we should
> be writing these as xpcshell tests... there is too much surrounding
> environment such as the addon manager which isn't initialized properly.
> Anyway, here's what I did to get healthreport stuff working: I had to fake
> some app-startup notifications and send some missing observer notifications
> from the xpcshell harness.

You can test AddonManager from xpcshell tests. It's much easier to test AddonManager from xpcshell tests than mochitests because you don't have to clean up state after the test - process level isolation - yay.

services/healthreport/tests/xpcshell/test_provider_addons.js has some boilerplate to initialize the AddonManager from xpcshell tests. Steal away. FWIW I filed a bug many moons ago for the Addons Manager to expose a test-only JSM to make that more robust and a one-liner.

If your test relies too much on startup and shutdown notifications, that's likely a hint that your code is too tightly coupled with notifications. The recent crash manager work, for example, has only 1 test that involves startup observers. Everything else just instantiates an instance of the "service object" and calls "init()" (which is what the observer handler does). I intentionally write my code to be largely agnostic of startup and shutdown observers to facilitate testing.
Attached patch WIP (obsolete) — Splinter Review
This should address previous comments except async shutdown.
Also fixes the test environment up to pull in the healthreporter and ease setting up fake data.
Attachment #8390534 - Attachment is obsolete: true
Attachment #8390534 - Flags: feedback?(felipc)
Attachment #8390534 - Flags: feedback?(bmcbride)
Attachment #8390950 - Flags: feedback?(felipc)
Attachment #8390950 - Flags: feedback?(bmcbride)
Attached patch Tests for condition evaluation (obsolete) — Splinter Review
Fix-up test_conditions.js.
Attachment #8390535 - Attachment is obsolete: true
Attached patch WIP (obsolete) — Splinter Review
Fixup, tests running properly.
Attachment #8390950 - Attachment is obsolete: true
Attachment #8390950 - Flags: feedback?(felipc)
Attachment #8390950 - Flags: feedback?(bmcbride)
Attachment #8391153 - Flags: feedback?(felipc)
Attachment #8391153 - Flags: feedback?(bmcbride)
Comment on attachment 8390833 [details] [diff] [review]
Test fixup

The latest WIP has both test_fetch.js & test_conditions.js working fine regarding the test environment.
See test_conditions.js for the test FHR payload setup etc.
Attachment #8390833 - Attachment is obsolete: true
Attached patch Tests for condition evaluation (obsolete) — Splinter Review
Fix test shutdown timeout (health reporter was not getting shut down).
Attachment #8390951 - Attachment is obsolete: true
Attached patch WIP (obsolete) — Splinter Review
Attachment #8391153 - Attachment is obsolete: true
Attachment #8391153 - Flags: feedback?(felipc)
Attachment #8391153 - Flags: feedback?(bmcbride)
Attachment #8391352 - Flags: feedback?(felipc)
Attachment #8391352 - Flags: feedback?(bmcbride)
Attached patch Tests 1: conditions (obsolete) — Splinter Review
Attachment #8391176 - Attachment is obsolete: true
Attached patch Tests 3: fetch (obsolete) — Splinter Review
Attached patch Tests 2: activation (obsolete) — Splinter Review
Attachment #8391354 - Attachment description: Tests: conditions → Tests 1: conditions
Attachment #8391357 - Attachment description: Tests: activation → Tests 2: activation
Attachment #8391355 - Attachment description: Tests: fetch → Tests 3: fetch
Attached patch Implement experiments manager (obsolete) — Splinter Review
* r?felipe is primary, r?unfocused for the AddonManager
* The spec is in bug 973997, the internal API spec in Experiments.jsm (the docs format comments)
* This now has basic test-coverage (separate patches)
* Async shutdown should be coming from bsmedberg in a separate patch
* The telemetry payload for the jsfilter (and fixed up sandboxing) is added in bug 983226
* Extended coverage for the condition evaluation (test_conditions.js) is bug 983231
* The telemetry probe TODOs are covered by bug 983360
* The TODO addon.type check depends on the type being added to the AddonManager in bug 973992

I think that's the important notes. I will push this to try tomorrow when i integrated it with the dependencies that are ready by then.
Attachment #8391352 - Attachment is obsolete: true
Attachment #8391352 - Flags: feedback?(felipc)
Attachment #8391352 - Flags: feedback?(bmcbride)
Attachment #8391631 - Flags: review?(felipc)
Attachment #8391631 - Flags: review?(bmcbride)
Attached file Tests 1: conditions (obsolete) —
Attachment #8391354 - Attachment is obsolete: true
Attachment #8391632 - Flags: review?(felipc)
Attached patch Tests 1: conditions (obsolete) — Splinter Review
Attachment #8391632 - Attachment is obsolete: true
Attachment #8391632 - Flags: review?(felipc)
Attachment #8391633 - Flags: review?(felipc)
Attached patch Tests 2: activation (obsolete) — Splinter Review
Attachment #8391357 - Attachment is obsolete: true
Attachment #8391634 - Flags: review?(felipc)
Attached patch Tests 3: Fetch manifest (obsolete) — Splinter Review
Attachment #8391355 - Attachment is obsolete: true
Attachment #8391635 - Flags: review?(felipc)
Attachment #8391636 - Flags: review?(felipc)
Duplicate of this bug: 973999
(In reply to Georg Fritzsche [:gfritzsche] from comment #58)
> Created attachment 8391631 [details] [diff] [review]
> Implement experiments manager
> 
> * r?felipe is primary, r?unfocused for the AddonManager
> * The spec is in bug 973997, the internal API spec in Experiments.jsm (the
> docs format comments)
> * This now has basic test-coverage (separate patches)
> * Async shutdown should be coming from bsmedberg in a separate patch
> * The telemetry payload for the jsfilter (and fixed up sandboxing) is added
> in bug 983226
> * Extended coverage for the condition evaluation (test_conditions.js) is bug
> 983231
> * The telemetry probe TODOs are covered by bug 983360
> * The TODO addon.type check depends on the type being added to the
> AddonManager in bug 973992
> 
> I think that's the important notes. I will push this to try tomorrow when i
> integrated it with the dependencies that are ready by then.

Also note previous review comments in bug 973999. That bug got folded into this one.
Blocks: 975000
Blocks: 977708
Blocks: 979474
Blocks: 973992
Blocks: 984014
Blocks: 983231
No longer depends on: 983231
Attached patch Implement experiments manager (obsolete) — Splinter Review
Fixed up experiment rejection reason propagation.
Attachment #8391631 - Attachment is obsolete: true
Attachment #8391631 - Flags: review?(felipc)
Attachment #8391631 - Flags: review?(bmcbride)
Attachment #8391750 - Flags: review?(felipc)
Attachment #8391750 - Flags: review?(bmcbride)
Attached patch test-1-conditions.patch (obsolete) — Splinter Review
Fixed up experiment rejection reason propagation.
Attachment #8391633 - Attachment is obsolete: true
Attachment #8391633 - Flags: review?(felipc)
Attachment #8391751 - Flags: review?(felipc)
Comment on attachment 8391750 [details] [diff] [review]
Implement experiments manager

Review of attachment 8391750 [details] [diff] [review]:
-----------------------------------------------------------------

I haven't finished reviewing the entire patch yet but I'm already posting what I've got as it might help a head start

::: browser/experiments/Experiments.jsm
@@ +61,5 @@
> +
> +function configureLogging() {
> +  gLogger.level = gPrefs.get(PREF_LOGGING_LEVEL, 50);
> +
> +  if (gPrefs.get(PREF_LOGGING_DUMP, true)) {

do you plan to change the default here to false? (as this pref is not being added in firefox.js)

@@ +71,5 @@
> +  }
> +}
> +
> +// Takes an array of promises and returns a promise that is resolved once all of
> +// them are rejected or resolved.

Promise.jsm has a Promise.all function that does something similar.. Maybe you can use it?
http://mxr.mozilla.org/mozilla-central/source/toolkit/modules/Promise-backend.js#400
It does have a difference in that it rejects as soon as one of the promises is rejected.

@@ +160,5 @@
> +
> +  oneshotTimer: function (target, timeout) {
> +    let timer = Cc["@mozilla.org/timer;1"].createInstance(Ci.nsITimer);
> +    timer.initWithCallback(this, timeout, Ci.nsITimer.TYPE_ONE_SHOT);
> +    return timer;

target arg not used? with that, `this` is the policy object, but the notify function is in the Experiments obj. So it looks like this actually doesn't work?

also, there's a common catch here with nsITimer usage in that the timer might be CC'ed before firing (and thus will never fire) in case you didn't store a reference to it. The usage of this function in this patch is fine but it's worth keeping that in mind that it can be used incorrectly easily. So a better idiom is that the function itself will store the timer in _timer instead of counting on the caller to do that.

@@ +183,5 @@
> +    });
> +  },
> +
> +  telemetryPayload: function () {
> +    return Promise.resolve({});

future feature?

@@ +208,5 @@
> +    updateManifest: null,
> +    loadFromCache: null,
> +    saveToCache: null,
> +    evaluateExperiments: null,
> +  },

the _pendingTasks object being defined in the prototype means that every Experiments.Experiments object will access the same _pendingTasks object, instead of each one accessing its own instance of it.

I assume this is not intentional, and was done as "documentation", right? But you'll actually need to initialize this obj from the constructor to make it correct.

I realize that due to this being a singleton a bug here won't happen at the moment, but it's better to go for correctness and prevent a hard-to-notice problem in the future

@@ +229,5 @@
> +    gPrefsTelemetry.observe(PREF_TELEMETRY_ENABLED, this._telemetryStatusChanged, this);
> +    gPrefsTelemetry.observe(PREF_TELEMETRY_PRERELEASE, this._telemetryStatusChanged, this);
> +
> +    this._experiments = new Map();
> +    this._loadFromCache();

in general it's strange to me that an instance object is changing global settings and using global functions as observers. I know it's because it's a singleton but it makes me a bit uneasy. I don't mind it enough to ask for this to be changed before landing, but it'd be nice if we think about untangling it a bit as a follow-up.

@@ +266,5 @@
> +   */
> +  toggleExperimentsEnabled: function (enabled) {
> +    gLogger.trace("Experiments::toggleExperimentsEnabled(" + enabled + ")");
> +    gPrefs.set(PREF_ENABLED, enabled);
> +  },

unused function?

@@ +315,5 @@
> +  getExperiments: function () {
> +    return Promise.resolve(this._experiments || this._loadFromCache()).then(
> +      () => {
> +        let list = [];
> +        try {

what could throw an exception here? I don't see anything

@@ +340,5 @@
> +            return -1;
> +          } else if (a.endDate < b.endDate) {
> +            return 1;
> +          }
> +          return 0;

can be simplified as `return b.endDate - a.endDate`

Is this meant for the largest endDate to appear first in the list? Otherwise I think you've got the -1, 1 switched..    (and you could simplify with `a.endDate - b.endDate`)

@@ +400,5 @@
> +
> +    return this._pendingTasks.updateManifest;
> +  },
> +
> +  notify: function (timer) {

this._timer should be cleared and null-ref.

You can use CommonUtils.namedTimer to get this + the previous comment for oneshotTimer for free :)

@@ +438,5 @@
> +    }
> +
> +    xhr.onload = function (event) {
> +      if (xhr.status !== 200 && xhr.state !== 0) {
> +        Cu.reportError("Experiments::httpGetRequest::onLoad() - Request to " + url + " returned status " + xhr.status);

gLogger.error instead of Cu.reportError?

@@ +529,5 @@
> +        let result = yield loadJSONAsync(path, { compression: "lz4" });
> +
> +        this._populateFromCache(result);
> +        yield this._evaluateExperiments();
> +      } catch (e if e instanceof OS.File.Error && e.becauseNoSuchFile) {

should the "invalid cache version" be treated as an expected error here too? (which can just be ignored as not having a cache file?)

It can happen on upgrade/downgrade or using the same profile with different versions of FF. I imagine this situation should not cause the _loadFromCache promise to fail which will cause various pendingTasksDone to be rejected

@@ +613,5 @@
> +    return true;
> +  },
> +
> +  _getActiveExperiment: function () {
> +    let enabled = [e for (e of this._experiments) if (e[1].enabled)];

I don't fully understand where these e[1] array accesses come from

@@ +620,5 @@
> +      return enabled[0][1];
> +    }
> +
> +    if (enabled.length > 1) {
> +      gLogger.error("Experiments::getActiveExperimentId() - should not have more than 1 active experiment");

why can only one experiment be active at a time?
(In reply to :Felipe Gomes from comment #69)
> > +  if (gPrefs.get(PREF_LOGGING_DUMP, true)) {
> 
> do you plan to change the default here to false? (as this pref is not being
> added in firefox.js)

Done.

> > +// Takes an array of promises and returns a promise that is resolved once all of
> > +// them are rejected or resolved.
> 
> Promise.jsm has a Promise.all function that does something similar.. Maybe
> you can use it?
> http://mxr.mozilla.org/mozilla-central/source/toolkit/modules/Promise-
> backend.js#400
> It does have a difference in that it rejects as soon as one of the promises
> is rejected.

I looked at Promise.all(), but i needed a utility that waits on all promises being rejected or resolved to serialize access to the experiment map and files.

> 
> @@ +160,5 @@
> > +
> > +  oneshotTimer: function (target, timeout) {
> > +    let timer = Cc["@mozilla.org/timer;1"].createInstance(Ci.nsITimer);
> > +    timer.initWithCallback(this, timeout, Ci.nsITimer.TYPE_ONE_SHOT);
> > +    return timer;
> 
> target arg not used? with that, `this` is the policy object, but the notify
> function is in the Experiments obj. So it looks like this actually doesn't
> work?

Oh, fixing. Looks like i didn't run this outside of xpcshell tests again after moving that there recently.

> also, there's a common catch here with nsITimer usage in that the timer
> might be CC'ed before firing (and thus will never fire) in case you didn't
> store a reference to it. The usage of this function in this patch is fine
> but it's worth keeping that in mind that it can be used incorrectly easily.
> So a better idiom is that the function itself will store the timer in _timer
> instead of counting on the caller to do that.

The calling and callee instance are different (Experiments.Experiments vs. Experiments.Policy).

> 
> @@ +183,5 @@
> > +    });
> > +  },
> > +
> > +  telemetryPayload: function () {
> > +    return Promise.resolve({});
> 
> future feature?

This is in bug 983226 (broken out due to time constraints).

> the _pendingTasks object being defined in the prototype means that every
> Experiments.Experiments object will access the same _pendingTasks object,
> instead of each one accessing its own instance of it.

Thanks, fixing.

> 
> @@ +229,5 @@
> > +    gPrefsTelemetry.observe(PREF_TELEMETRY_ENABLED, this._telemetryStatusChanged, this);
> > +    gPrefsTelemetry.observe(PREF_TELEMETRY_PRERELEASE, this._telemetryStatusChanged, this);
> > +
> > +    this._experiments = new Map();
> > +    this._loadFromCache();
> 
> in general it's strange to me that an instance object is changing global
> settings and using global functions as observers. I know it's because it's a
> singleton but it makes me a bit uneasy. I don't mind it enough to ask for
> this to be changed before landing, but it'd be nice if we think about
> untangling it a bit as a follow-up.

Hm, the only global observer function i see is configureLogging, which seems fine (no need to configure logging per-instance).
I think all the global settings actually would also make sense to be shared across instances. But maybe i'm missing something?

> 
> @@ +266,5 @@
> > +   */
> > +  toggleExperimentsEnabled: function (enabled) {
> > +    gLogger.trace("Experiments::toggleExperimentsEnabled(" + enabled + ")");
> > +    gPrefs.set(PREF_ENABLED, enabled);
> > +  },
> 
> unused function?

It's part of the public API (all the functions with doxygen style comments are), to be used by the the AddonManager UX for this, about:support, ...

> 
> @@ +315,5 @@
> > +  getExperiments: function () {
> > +    return Promise.resolve(this._experiments || this._loadFromCache()).then(
> > +      () => {
> > +        let list = [];
> > +        try {
> 
> what could throw an exception here? I don't see anything

Fixed, probably left-over.

> 
> @@ +340,5 @@
> > +            return -1;
> > +          } else if (a.endDate < b.endDate) {
> > +            return 1;
> > +          }
> > +          return 0;
> 
> can be simplified as `return b.endDate - a.endDate`

Ah, yeah, that looks much nicer. 

> Is this meant for the largest endDate to appear first in the list?

Yes, it's supposed to have the most recent experiment first (i.e. largest endDate).

> You can use CommonUtils.namedTimer to get this + the previous comment for
> oneshotTimer for free :)

Oh, let me take a look.

> @@ +529,5 @@
> > +        let result = yield loadJSONAsync(path, { compression: "lz4" });
> > +
> > +        this._populateFromCache(result);
> > +        yield this._evaluateExperiments();
> > +      } catch (e if e instanceof OS.File.Error && e.becauseNoSuchFile) {
> 
> should the "invalid cache version" be treated as an expected error here too?
> (which can just be ignored as not having a cache file?)
> 
> It can happen on upgrade/downgrade or using the same profile with different
> versions of FF. I imagine this situation should not cause the _loadFromCache
> promise to fail which will cause various pendingTasksDone to be rejected

Good point, fixing.

> @@ +613,5 @@
> > +    return true;
> > +  },
> > +
> > +  _getActiveExperiment: function () {
> > +    let enabled = [e for (e of this._experiments) if (e[1].enabled)];
> 
> I don't fully understand where these e[1] array accesses come from

this._experiments is a Map (id -> ExperimentEntry) & and over Map iteration is by |[key,value]|.

> 
> @@ +620,5 @@
> > +      return enabled[0][1];
> > +    }
> > +
> > +    if (enabled.length > 1) {
> > +      gLogger.error("Experiments::getActiveExperimentId() - should not have more than 1 active experiment");
> 
> why can only one experiment be active at a time?

By design & spec we don't more than one active at a time (e.g. think possibly interactions between multiple experiments).
(In reply to Georg Fritzsche [:gfritzsche] from comment #70)
> > You can use CommonUtils.namedTimer to get this + the previous comment for
> > oneshotTimer for free :)

Hm, it lives in services/common/utils.js, this lives in browser/experiments/ - can we actually depend on this?
(In reply to Georg Fritzsche [:gfritzsche] from comment #71)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #70)
> > > You can use CommonUtils.namedTimer to get this + the previous comment for
> > > oneshotTimer for free :)
> 
> Hm, it lives in services/common/utils.js, this lives in browser/experiments/
> - can we actually depend on this?

For desktop, yes. See also toolkit/modules/Timer.jsm.
Comment on attachment 8391750 [details] [diff] [review]
Implement experiments manager

Review of attachment 8391750 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +42,5 @@
> +const PREF_HEALTHREPORT_ENABLED = "datareporting.healthreport.service.enabled";
> +
> +const PREF_BRANCH_TELEMETRY     = "toolkit.telemetry.";
> +const PREF_TELEMETRY_ENABLED    = "enabled";
> +const PREF_TELEMETRY_PRERELEASE = "enabledPreRelease";

Use TelemetryPing.Constants.PREF_ENABLED and keep this leaky implementation detail at bay!

@@ +47,5 @@
> +
> +const gPrefs = new Preferences(PREF_BRANCH);
> +const gPrefsTelemetry = new Preferences(PREF_BRANCH_TELEMETRY);
> +let gExperimentsEnabled = false;
> +let gTelemetryEnabled = false;

I wouldn't bother tracking this in a variable - just look it up from prefs (prefs are fast to consult since they are essentially a giant in-memory hash map).

Having explicit variables just adds one more thing to keep in sync and something that can break. KISS.

@@ +51,5 @@
> +let gTelemetryEnabled = false;
> +let gExperiments = null;
> +let gLogAppenderDump = null;
> +
> +XPCOMUtils.defineLazyGetter(this, "gLogger", function() {

Why's this a lazy getter? Why not use loggers on each object?

@@ +98,5 @@
> +// OS.File.read.
> +// Returns a Promise resolved with the json payload or rejected with
> +// OS.File.Error or JSON.parse() errors.
> +function loadJSONAsync(file, options) {
> +  return Task.spawn(function() {

function* for all generator functions.

(As written is technically in violation of the spec. There are plans somewhere to make SpiderMonkey conforming and mass-cleanup all existing generator functions to the new syntax.)

@@ +116,5 @@
> +}
> +
> +function telemetryEnabled() {
> +  return gPrefsTelemetry.get(PREF_TELEMETRY_ENABLED, false) ||
> +         gPrefsTelemetry.get(PREF_TELEMETRY_PRERELEASE, false);

Replace with TelemetryPing.Constants.PREF_ENABLED only.

@@ +178,5 @@
> +            .wrappedJSObject
> +            .healthReporter;
> +      yield reporter.onInit();
> +      let payload = yield reporter.collectAndObtainJSONPayload();
> +      throw new Task.Result(payload);

You don't need to throw Task.Result() any more. New-style generator functions allow you to simply return a value. Unlike other languages (like Python), you can mix yield and return in the same function.

@@ +192,5 @@
> + * Manages the experiments and provides an interface to control them.
> + */
> +
> +Experiments.Experiments = function (policy) {
> +  this._policy = policy || new Experiments.Policy();

I believe you can write this as:

function (policy=new Experiments.Policy())

I don't believe JS will instantiate the new object unless no parameter was passed to the function.

@@ +372,5 @@
> +    }
> +
> +    let uri = Services.urlFormatter.formatURLPref(PREF_BRANCH + PREF_MANIFEST_URI);
> +
> +    this._pendingTasks.updateManifest = Task.spawn(function () {

function*

::: browser/experiments/test/xpcshell/head.js
@@ +18,5 @@
> +
> +const EXPERIMENT2_ID       = "test-experiment-2@tests.mozilla.org"
> +const EXPERIMENT2_XPI_SHA1 = "sha1:81877991ec70360fb48db84c34a9b2da7aa41d6a";
> +const EXPERIMENT2_XPI_NAME = "experiment-2.xpi";
> +

Just so you know, I'm on an informal crusade to kill head.js and tail.js. Well, at least the parts that define common code.

It's much preferred to write a testing-only JSM that can be imported by each test. The reason is to promote code re-usability across tests from multiple components. Your use of getHealthReporter() below and the comparative hack to import the Addons Manager via some cryptic 10-line JavaScript function clearly contrast the differences between both methods.

This *is* relevant because we'll likely need to write some AddonsManager tests that verify old experiments are treated properly. Those tests will (likely) live under toolkit/mozapps/extensions (because their testing framework is quite involved and nearly impossible to instantiate from other components) and those tests will need to perform some of the same initialization and monkeypatching done here. Having the code in a reusable JSM makes this much, much easier. Lowering the barrier to writing tests is good!

@@ +110,5 @@
> +  };
> +
> +  let registrar = Components.manager.QueryInterface(Ci.nsIComponentRegistrar);
> +  registrar.registerFactory(XULAPPINFO_CID, "XULAppInfo",
> +                            XULAPPINFO_CONTRACTID, XULAppInfoFactory);

testing/modules/AppInfo.jsm contains this boilerplate.

Cu.import("resource://testing-common/AppInfo.jsm");
Comment on attachment 8391750 [details] [diff] [review]
Implement experiments manager

Review of attachment 8391750 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +307,5 @@
> +   *   active: <boolean>,
> +   *   endDate: <integer>, // epoch ms
> +   *   detailURL: <string>,
> +   *   ... // possibly extended later
> +   * }

The Add-on Manager API requires the following properties for previous experiments (if we want to make the add-on manager hacking minimal):

* author
* id
* name

Not sure if we plan on storing an accurate author in the metadata of each experiment. We can always insert a dummy value (e.g. "Mozilla") when these old experiments get translated to Add-on Manager "Addon" instances.
id & name are present for the historical entries. We don't need name for the current spec (and there is only one source for pusing experiments anyway), so let's just use a dummy.
(In reply to Gregory Szorc [:gps] from comment #73)
> ::: browser/experiments/test/xpcshell/head.js
> @@ +18,5 @@
> > +
> > +const EXPERIMENT2_ID       = "test-experiment-2@tests.mozilla.org"
> > +const EXPERIMENT2_XPI_SHA1 = "sha1:81877991ec70360fb48db84c34a9b2da7aa41d6a";
> > +const EXPERIMENT2_XPI_NAME = "experiment-2.xpi";
> > +
> 
> Just so you know, I'm on an informal crusade to kill head.js and tail.js.
> Well, at least the parts that define common code.
> 
> It's much preferred to write a testing-only JSM that can be imported by each
> test. The reason is to promote code re-usability across tests from multiple
> components. Your use of getHealthReporter() below and the comparative hack
> to import the Addons Manager via some cryptic 10-line JavaScript function
> clearly contrast the differences between both methods.
> 
> This *is* relevant because we'll likely need to write some AddonsManager
> tests that verify old experiments are treated properly. Those tests will
> (likely) live under toolkit/mozapps/extensions (because their testing
> framework is quite involved and nearly impossible to instantiate from other
> components) and those tests will need to perform some of the same
> initialization and monkeypatching done here. Having the code in a reusable
> JSM makes this much, much easier. Lowering the barrier to writing tests is
> good!

I definitely see the benefits of this after the patches here. Let's just address this when we get there with testing?
(In reply to Georg Fritzsche [:gfritzsche] from comment #70)
> > @@ +266,5 @@
> > > +   */
> > > +  toggleExperimentsEnabled: function (enabled) {
> > > +    gLogger.trace("Experiments::toggleExperimentsEnabled(" + enabled + ")");
> > > +    gPrefs.set(PREF_ENABLED, enabled);
> > > +  },
> > 
> > unused function?

ok. as there's a getter for |enabled|, this would be nicer as a setter instead of a different function

> > @@ +613,5 @@
> > > +    return true;
> > > +  },
> > > +
> > > +  _getActiveExperiment: function () {
> > > +    let enabled = [e for (e of this._experiments) if (e[1].enabled)];
> > 
> > I don't fully understand where these e[1] array accesses come from
> 
> this._experiments is a Map (id -> ExperimentEntry) & and over Map iteration
> is by |[key,value]|.

ok cool, in that case you can name the two variables on the iterator to make that more clear. like:
[experiment for ([, experiment] of this._experiments) if (experiment.enabled)];


I'm just curious about the timer and scheduler and what will happen if a session is not long enough to run an experiment.. I don't know what range of time we're talking here: seconds, minutes, hours?.. I just would like to mention there's the update interval service to deal with this type of thing across sessions, if the interval is big enough for it to be a concern
Comment on attachment 8391750 [details] [diff] [review]
Implement experiments manager

Review of attachment 8391750 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +918,5 @@
> +    });
> +
> +    this.DATE_KEYS.forEach(key => {
> +      obj[key] = this[key];
> +    });

why the split here in SERIALIZE_KEYS and DATE_KEYS?
(In reply to :Felipe Gomes from comment #78)
> ::: browser/experiments/Experiments.jsm
> @@ +918,5 @@
> > +    });
> > +
> > +    this.DATE_KEYS.forEach(key => {
> > +      obj[key] = this[key];
> > +    });
> 
> why the split here in SERIALIZE_KEYS and DATE_KEYS?

To special case on dates to serialize them as epoch-ms.
(In reply to :Felipe Gomes from comment #77)
> I'm just curious about the timer and scheduler and what will happen if a
> session is not long enough to run an experiment.. I don't know what range of
> time we're talking here: seconds, minutes, hours?.. I just would like to
> mention there's the update interval service to deal with this type of thing
> across sessions, if the interval is big enough for it to be a concern

I don't think we've had specific time frames thrown around, but we have to address them being active over sessions anyway.
We cover both cross-session and in-session intervals by scheduling a timer (Experiments._timer) that triggers experiment re-evaluation (coming up fixed in a bit in a new revision here).
Comment on attachment 8391750 [details] [diff] [review]
Implement experiments manager

Review of attachment 8391750 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +1120,5 @@
> +          //}
> +
> +          let addon = install.addon;
> +          this._name = addon.name;
> +          this._description = addon.description || "";

Should make these getters that always return the live value from the Addon object - there are cases when the value gets updated during runtime. Not sure if we'll do that for experiments (involves a metadata ping to AMO, especially useful for non en-US locales), but worth it just in case.

@@ +1171,5 @@
> +        deferred.reject(new Error(message));
> +        return;
> +      }
> +
> +      addon.uninstall();

While in practice this a synchronous operation at the moment, the API doesn't guarantee it (and it should become async in the future). You should be waiting for the onUninstalled event to fire before resolving the promise.

@@ +1178,5 @@
> +      this._lastChangedDate = now;
> +      this._endDate = now;
> +      this._enabled = false;
> +
> +      // TODO telemetry: experiment disabling, differentiate by userDisabled

Waiting to see this implemented.
Attachment #8391750 - Flags: review?(bmcbride) → review-
Comment on attachment 8391750 [details] [diff] [review]
Implement experiments manager

Review of attachment 8391750 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +36,5 @@
> +const PREF_ENABLED              = "enabled";
> +const PREF_LOGGING              = "logging";
> +const PREF_LOGGING_LEVEL        = PREF_LOGGING + ".level";
> +const PREF_LOGGING_DUMP         = PREF_LOGGING + ".dump";
> +const PREF_MANIFEST_URI         = "manifest.uri";

this is just extremely OCD, but since pref names are being concat'ed together and accessed through a pref branch, can you add to the end of each line a comment with the full name of the pref? This makes it possible to search for them in the code
(In reply to Blair McBride [:Unfocused] from comment #81)
> ::: browser/experiments/Experiments.jsm
> @@ +1120,5 @@
> > +          //}
> > +
> > +          let addon = install.addon;
> > +          this._name = addon.name;
> > +          this._description = addon.description || "";
> 
> Should make these getters that always return the live value from the Addon
> object - there are cases when the value gets updated during runtime. Not
> sure if we'll do that for experiments (involves a metadata ping to AMO,
> especially useful for non en-US locales), but worth it just in case.

We need that data for the active experiment and all historical experiments, hence we need to cache it.
Attached patch Implement experiments manager (obsolete) — Splinter Review
Attachment #8391750 - Attachment is obsolete: true
Attachment #8391750 - Flags: review?(felipc)
Attachment #8392028 - Flags: review?(felipc)
Attachment #8391751 - Attachment is obsolete: true
Attachment #8391751 - Flags: review?(felipc)
Attachment #8392029 - Flags: review?(felipc)
Attachment #8391634 - Attachment is obsolete: true
Attachment #8391634 - Flags: review?(felipc)
Attachment #8392030 - Flags: review?(felipc)
Attachment #8391635 - Attachment is obsolete: true
Attachment #8391635 - Flags: review?(felipc)
Attachment #8392031 - Flags: review?(felipc)
Attachment #8391636 - Attachment is obsolete: true
Attachment #8391636 - Flags: review?(felipc)
Attachment #8392032 - Flags: review?(felipc)
Patch queue of the currently working patches pushed:
https://tbpl.mozilla.org/?tree=Try&rev=7bd9f9c3cb40
Comment on attachment 8392028 [details] [diff] [review]
Implement experiments manager

Review of attachment 8392028 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +212,5 @@
> +  this.init();
> +};
> +
> +Experiments.Experiments.prototype = {
> +  QueryInterface: XPCOMUtils.generateQI([Ci.nsITimerCallback, Ci.nsIObserver]),

no need to implement nsITimerCallback anymore as namedTimer handles that
Attachment #8392028 - Flags: review?(felipc) → review+
Attachment #8392029 - Flags: review?(felipc) → review+
Attachment #8392030 - Flags: review?(felipc) → review+
Attachment #8392031 - Flags: review?(felipc) → review+
Comment on attachment 8392032 [details] [diff] [review]
Tests 4 - API, disabling & higher-level coverage

nit: xpcshell.ini has a \no newline at the end of file introduced in one of the tests patch.
Attachment #8392032 - Flags: review?(felipc) → review+
Attachment #8392028 - Flags: review?(bmcbride)
Comment on attachment 8392028 [details] [diff] [review]
Implement experiments manager

Review of attachment 8392028 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/experiments/Experiments.jsm
@@ +1076,5 @@
> +
> +    return Task.spawn(function ExperimentEntry_runFilterFunction_task() {
> +      const nullprincipal = Cc["@mozilla.org/nullprincipal;1"].createInstance(Ci.nsIPrincipal);
> +      let options = {
> +        sandboxName: "telemetry experiments jsfilter sandbox",

sandboxName should ideally be unique so we can better track it - so it should include the experiment ID (I think convention is to put it in brackets at the end).

@@ +1203,5 @@
> +        return;
> +      }
> +
> +      let listener = {};
> +      listener.onUninstalled = addon => {

So, just had a thought: What happens if some other code uninstalls one of these add-ons when you're not expecting it? Seems like it would, uh, break your world. IMO, you do need to protect against this type of issue when dealing with the Add-ons Manager.

I think on startup you want to register an addon listener. And in here, tell your listener that you're expecting an uninstall - so the listener can resolve this promise when it sees that uninstall.

Thinking about that more, that listener should also handle the case where the add-on is disabled too (onDisabled) - so you can respond with uninstalling the add-on. You'll need that anyway, so that when someone clicks the Disable button in the UI, it does what you want it to do (ie, uninstall).
Attachment #8392028 - Flags: review?(bmcbride) → review-
(In reply to Blair McBride [:Unfocused] from comment #92)
> Thinking about that more, that listener should also handle the case where
> the add-on is disabled too (onDisabled) - so you can respond with
> uninstalling the add-on. You'll need that anyway, so that when someone
> clicks the Disable button in the UI, it does what you want it to do (ie,
> uninstall).

The AddonManager experiments UI should disable experiments use the public API provided here.
(In reply to Blair McBride [:Unfocused] from comment #92)
> I think on startup you want to register an addon listener. And in here, tell
> your listener that you're expecting an uninstall

That sounds a little more interdependent than i'd like. We can have both a specialized listener in stop() and a common one though.
Attachment #8392028 - Attachment is obsolete: true
Attachment #8392171 - Flags: review?(bmcbride)
The try run above is all fine, so that patch queue should be good to go pending r?unfocused.
Comment on attachment 8392171 [details] [diff] [review]
Implement experiments manager

Review of attachment 8392171 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Georg Fritzsche [:gfritzsche] from comment #93)
> The AddonManager experiments UI should disable experiments use the public
> API provided here.

But it can't :) That's not how the Add-ons Manager works. The Add-ons Manager UI is built to be generic in how it handles different add-on types. The patches in bug 973992 that result in the experiment add-ons showing up in the Add-ons Manager UI actually barely touchs the UI code - most of it is/will be dealing with backend support that's needed anyway, and the UI just automatically does everything for us.

But, that global listener handles that fine now anyway.

(In reply to Georg Fritzsche [:gfritzsche] from comment #94)
> We can have both a
> specialized listener in stop() and a common one though.

Hmm, ok, I think the way you've done that seems to work for the various edge cases.

::: browser/experiments/Experiments.jsm
@@ +1221,5 @@
> +    AddonManager.getAddonByID(this.id, addon => {
> +      if (!addon) {
> +        let message = "could not get Addon for " + this.id;
> +        gLogger.error("ExperimentEntry::stop() - " + message);
> +        deferred.reject(new Error(message));

Hm, this isn't actually an error - at least, not one that should prevent anything else from working (ie, promise should just be immediately resolved here). It just means the add-on was already uninstalled via some other means.
Attachment #8392171 - Flags: review?(bmcbride) → review+
Blocks: 984879
Blocks: 985179
Depends on: 973997
Was about to file this as a followup bug, but not sure if it's bugworthy or where the right place to discuss potential features. ;)

I've run into an issue with Test Pilot where I wanted to deploy to say 1% of en-US but 10% of zh-CN because the audience sizes are so different. Would the way to get this different sampling split be through creating two experiments that happen to point to the same xpiURL, etc?
Yes, we'd do it through separate experiment entries, each targeting a non-overlapping audience. Alternatively, we could shoehorn something into filter functions to facilitate this.
Depends on: 1028612
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.