Closed Bug 973997 Opened 11 years ago Closed 11 years ago

Telemetry experiments: decide on manifest url scheme and format

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Firefox 31

People

(Reporter: benjamin, Assigned: gps)

References

Details

Attachments

(1 file, 3 obsolete files)

Need decisions on the manifest format and URL scheme for the client to request the manifest from the server. Probably only want very basic metadata in the URL (product, version, channel, os): let the conditions in the manifest itself do most of the limiting. For starters we probably want just one manifest file for everything and do everything with conditions, but we want to be flexible for the future.
Blocks: 973998
Blocks: 973999
See also bug 850491 for context regarding URL decisions. tl;dr use many URLs to facilitate Ops-y things and to facilitate future changes.
(basic version here maybe add locale / lang / geo ) (locale is fine!)
I need to add some bits about URL fetching and make the language more consistent. But I think this is a good start and at a place where feedback would be useful. I cobled together conditions from everywhere I could find. What appears here is a union of what's out there plus some "common sense" things like kill switches that I threw in.
Attachment #8382660 - Flags: feedback?(georg.fritzsche)
Attachment #8382660 - Flags: feedback?(benjamin)
Major thing missing: what the context passed to the js filter functions looks like. That should definitely be defined somewhere. We may also want to attach enumerated values where possible, as I imagine experiment authors will want to consult an authoritative list.
Comment on attachment 8382660 [details] [diff] [review] Define and document Experiments Manifests Review of attachment 8382660 [details] [diff] [review]: ----------------------------------------------------------------- Thanks, this looks pretty good to me. Remarks/Questions following. ::: browser/experiments/docs/manifest.rst @@ +41,5 @@ > + > +id > + (required) String identifier of this experiment. The identifier should > + be treated as opaque by clients. It is used to uniquely identify an > + experiment for all of time. If they get unique ids, can't we just use a map instead of an array? @@ +64,5 @@ > + > +startTime > + (optional) Integer seconds since UNIX epoch that this experiment should > + start. Clients should not start an experiment if *now()* is less than > + this value. What about clients where |startTime| is still in the future? Do they need to schedule activation for |startTime| or is it enough to activate them on some periodic schedule? Same for |endTime| and |maxActiveSeconds|. @@ +227,5 @@ > + The client must remember experiments that have been tested for > + probability-based applicability and it must not reconsider experiments > + for which it has previously failed to join due to this field. The > + exception is if the probability value changes. In that case, the client > + may re-evaluate its applicability into an experiment. Does it makes sense to track the probability client-side? Couldn't evaluation be triggered instead by something in the manifest? @@ +262,5 @@ > + > + The string is expected to contain the definition of a JavaScript function > + ``filter(context)``. This function receives as its argument an object > + describing the client's state. The function's return value is evaluated > + in an ``if`` statement to determine whether the experiment is applicable. if-statement might be a little more specific than needed? How about a boolean context or something like that? @@ +297,5 @@ > +If a specific experiment disappears from the manifest, the client should > +continue conducting an already-active experiment. If an inactive but > +scheduled-to-be-active experiment disappears from the manifest, the > +client should not activate the experiment. If the experiment reappears > +in the manifest, the client may add it again. Does the client still need to honor things like |probability| changes when an experiment re-appears? I think this will decide on whether we have to keep track of removed experiments forever.
Attachment #8382660 - Flags: feedback?(georg.fritzsche) → feedback+
Attachment #8382660 - Flags: feedback?(glind)
Comment on attachment 8382660 [details] [diff] [review] Define and document Experiments Manifests >+Manifest Format >+=============== >+ >+Manifests are JSON documents where the main element is an object. >+ >+The *schema* of the object is versioned and defined by the presence >+of a top-level ``version`` property, whose integer value is the >+schema version used by that manifest. Each version is documented >+in the sections below. What should a client do if they receive a version that they don't recognize? >+xpiSHA1 >+ (required) String SHA-1 (40 digit hexidecimal form) of the XPI for >+ this experiment. >+ >+ When the client downloads the XPI for the experiment, it should compare >+ the SHA-1 of that XPI against this value. If the hashes don't match, >+ the client should not install the XPI. >+ >+ Clients may also use this SHA-1 as a means of determining when an >+ experiment's XPI has changed and should be refreshed. (TODO is this >+ necessary or do we get this for free with the add-on manager?) We have prior art for hashes that we should reuse. In particular instead of specifying the hash format in the key name, it is typically specified in the value: <pfs:installerHash>sha256:abasdfr....</pfs:installerHash> We should use that convention here. >+startTime >+ (optional) Integer seconds since UNIX epoch that this experiment should >+ start. Clients should not start an experiment if *now()* is less than >+ this value. This should not be optional. >+maxStartTime >+ (optional) Integer seconds since UNIX epoch after which this experiment >+ should no longer start. This should not be optional. >+maxActiveSeconds >+ (optional) Integer seconds defining the max wall time this experiment >+ should be active for. This should not be optional. >+maxSessions >+ (optional) Integer number of application sessions this experiment should be >+ active for. >+ >+ If this is defined, the client should deactivate the experiment after this >+ many application sessions (typically process lifetimes) have surpassed. >+ >+ If this is not specified, the duration of the experiment is not limited by >+ the number of sessions. This isn't required for v1 so let's not include or implement this for now. >+product >+ (optional) Array of product identifiers this experiment should run on. >+ >+ A product identifier is something like ``desktop`` or ``fennec``. >+ >+ The client should compare its product name to members of this array. If >+ a match is found, the experiment is applicable. ``desktop``? Where does this come from? Shouldn't we either use the app name or the app ID from nsIXULAppInfo for this? >+os >+ (optional) Array of operating system identifiers this experiment should >+ run on. >+ >+ The client will compare its operating system identifier to members >+ of this array. If a match is found, the experiment is applicable to the >+ client. Is this from nsIXULRuntime.OS? >+language >+ (optional) Array of language identifiers this experiment should run on. >+ >+ A language identifier is a string like ``en`` or ``de``. >+ >+ The client should compare its language identifier to members of this array. >+ If a match is found, the experiment is applicable. >+ >+ If this property is not defined, the client should assume the experiment >+ is to run on all languages. >+ >+locale >+ (optional) Array of locale identifiers this experiment should run on. >+ >+ A locale identifier is a string like ``en-US`` or ``zh-CN``. >+ >+ The client should compare its locale identifier to members of this array. >+ If a match is found, the experiment is applicable. >+ >+ If this property is not defined, the client should assume the experiment >+ is to run on all locales. What is the difference between language and locale in this example, and what are we using to compute the client locale? If the user installs and activates a language pack, will the locale be the originally installed locale or the language pack locale? >+probability >+ (optional) Decimal number indicating the sampling rate for this experiment. >+ >+ This will contain a value between ``0.0`` and ``1.0``. The client should >+ generate a random decimal between ``0.0`` and ``1.0``. If the randomly >+ generated number is less than or equal to the value of this field, the >+ experiment is applicable. >+ >+ The client must remember experiments that have been tested for >+ probability-based applicability and it must not reconsider experiments >+ for which it has previously failed to join due to this field. The >+ exception is if the probability value changes. In that case, the client >+ may re-evaluate its applicability into an experiment. Can we call this "sample" instead of probability. And we also need to clarify whether it is calculated before or after the other conditions including the jsfilter. It seems to me that after makes more sense. We should record the sample rate and result the first time we compute it. Then, if the sample rate changes, we can compute the rate difference and apply a new randomization. So for instance: Example A: Start sample: 10% - we hit NO Later sample: 20% - we do another 10% sample Example B: Start sample: 10% - we hit NO Later sample: 5% - automatically rejected Example C: Start sample: 10% but we don't match the other conditions We don't compute anything and wait for a future world where the conditions do match. >+jsfilter >+ (optional) JavaScript code that will be evaluated to determine experiment >+ applicability. >+ >+ This property contains the string representation of JavaScript code that >+ will be evaluated in a sandboxed environment using JavaScript's >+ ``eval()``. >+ >+ The string is expected to contain the definition of a JavaScript function >+ ``filter(context)``. This function receives as its argument an object >+ describing the client's state. The function's return value is evaluated >+ in an ``if`` statement to determine whether the experiment is applicable. >+ >+ The purpose of this property is to allow experiments to define complex >+ rules and logic for evaluating experiment applicability in a manner >+ that is privacy concious and doesn't require the transmission of >+ excessive data. >+ >+ If this property is not defined, the client does not consider a custom >+ JavaScript filter function when determining whether an experiment is >+ applicable. nit spelling "conscious". One of Gregg's requests is that the filter function be able to express some details about why it failed. I think the best way to do this is by throwing an exception. So for example a function might do: function filter(context) { var fhr = context.getHealthData(); // reject people who can't run our DLL if (fhr.geckoAppInfo.xpcomabi != "x86_64-gcc3") { throw context.Rejection("Incorrect xpcomabi: " + fhr.geckoAppInfo.xpcomabi); } // reject users with conflicting extensions var addons = fhr.geckoAppInfo.data.last["org.mozilla.addons.active"]; var inspector = addons["inspector@mozilla.org"]; if (inspector !== undefined && !inspector.userDisabled && !inspector.appDisabled) { throw context.Rejection("DOMI active"); } return true; } Then when we are logging in telemetry why an experiment was not enabled, we can include the Rejection data. The context data we should supply is: * the FHR payload * the current telemetry payload * the Rejection function or something equivalent for returning reasons from the function Also we should document that if the function throws an error (Rejection or otherwise) the experiment will not run (and we log the exception to telemetry). > If they get unique ids, can't we just use a map instead of an array? I like the array because it expresses a form of priority. In particular since we'll only be enabling one experiment at a time, we'll always check the first experiment before moving on to the next. > What about clients where |startTime| is still in the future? Do they need to schedule activation for |startTime| or is it enough to activate them on some periodic schedule? Same for |endTime| and |maxActiveSeconds|. Good question. I think that for now the once-a-day timer would be sufficient, but if we can provide more precision easily feel free to do that. > I think this will decide on whether we have to keep track of removed experiments forever. Indeed, I didn't think about this too carefully. I think we should keep the log of experiments/probabilities for the same 180 day window as the rest of FHR data. There is one case that comes to mind which is not addressed by the spec: what should we do if the conditions change once an experiment has already been installed? We have a couple options: A. once an experiment is installed, the conditions are irrelevant B. honor changes to all of the conditions except the sample rate C. honor changes to all of the conditions including the sample rate I think for v1 (A) would be fine but (B) would be better long-term. So feel free to either implement B now or punt it to a followup bug.
Attachment #8382660 - Flags: feedback?(benjamin)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #6) > What should a client do if they receive a version that they don't recognize? Good question. It's not yet documented, but I'd like the fetch URLs to contain the client type,, channel and version number of the client. That way, we can maintain a server-side mapping of client support to manifest version so we never release an unsupported manifest version to an incompatible client. OK, "never" is really "rarely" and that case should be limited to "Nightly" when new manifest versions are implemented. > >+maxSessions > >+ (optional) Integer number of application sessions this experiment should be > >+ active for. > >+ > >+ If this is defined, the client should deactivate the experiment after this > >+ many application sessions (typically process lifetimes) have surpassed. > >+ > >+ If this is not specified, the duration of the experiment is not limited by > >+ the number of sessions. > > This isn't required for v1 so let's not include or implement this for now. My thinking with specifications and versioning is that the fewer times you version something the better. I errored on the side of being more inclusive. It's not like we're writing super formal specifications that need to interop - I think it would be fine if we defined these things but just didn't implement them on the client nor used them in the wild until clients supported them. > >+product > >+ (optional) Array of product identifiers this experiment should run on. > >+ > >+ A product identifier is something like ``desktop`` or ``fennec``. > >+ > >+ The client should compare its product name to members of this array. If > >+ a match is found, the experiment is applicable. > > ``desktop``? Where does this come from? Shouldn't we either use the app name > or the app ID from nsIXULAppInfo for this? I just put in a place holder. I think either nsIXULAppInfo field makes sense. We should of course document this. > >+os > >+ (optional) Array of operating system identifiers this experiment should > >+ run on. > >+ > >+ The client will compare its operating system identifier to members > >+ of this array. If a match is found, the experiment is applicable to the > >+ client. > > Is this from nsIXULRuntime.OS? Presumably? > >+language > >+ (optional) Array of language identifiers this experiment should run on. > >+ > >+ A language identifier is a string like ``en`` or ``de``. > >+ > >+ The client should compare its language identifier to members of this array. > >+ If a match is found, the experiment is applicable. > >+ > >+ If this property is not defined, the client should assume the experiment > >+ is to run on all languages. > >+ > >+locale > >+ (optional) Array of locale identifiers this experiment should run on. > >+ > >+ A locale identifier is a string like ``en-US`` or ``zh-CN``. > >+ > >+ The client should compare its locale identifier to members of this array. > >+ If a match is found, the experiment is applicable. > >+ > >+ If this property is not defined, the client should assume the experiment > >+ is to run on all locales. > > What is the difference between language and locale in this example, and what > are we using to compute the client locale? If the user installs and > activates a language pack, will the locale be the originally installed > locale or the language pack locale? I saw both "language" and "locale" somewhere, so I included both. I think active language package / locale makes sense. But I can see a case for distinguishing original from active. > >+probability > >+ (optional) Decimal number indicating the sampling rate for this experiment. > >+ > >+ This will contain a value between ``0.0`` and ``1.0``. The client should > >+ generate a random decimal between ``0.0`` and ``1.0``. If the randomly > >+ generated number is less than or equal to the value of this field, the > >+ experiment is applicable. > >+ > >+ The client must remember experiments that have been tested for > >+ probability-based applicability and it must not reconsider experiments > >+ for which it has previously failed to join due to this field. The > >+ exception is if the probability value changes. In that case, the client > >+ may re-evaluate its applicability into an experiment. > > Can we call this "sample" instead of probability. And we also need to > clarify whether it is calculated before or after the other conditions > including the jsfilter. It seems to me that after makes more sense. Good catch. I agree filtering after all other conditions makes sense. > We should record the sample rate and result the first time we compute it. > Then, if the sample rate changes, we can compute the rate difference and > apply a new randomization. So for instance: > > Example A: > Start sample: 10% - we hit NO > Later sample: 20% - we do another 10% sample > > Example B: > Start sample: 10% - we hit NO > Later sample: 5% - automatically rejected > > Example C: > Start sample: 10% but we don't match the other conditions > We don't compute anything and wait for a future world where the conditions > do match. Interesting idea. I'd like to revise it slightly. Instead of adjusting based on the differences between witnessed manifest values, we record the randomly generated sampling value from the first attempt. On subsequent manifest fetches, we compare this stored value against the manifest's value. I think there are fewer scenarios where this could lead to weirdness (i.e. a manifest sampling value that changes multiple times). > >+jsfilter > >+ (optional) JavaScript code that will be evaluated to determine experiment > >+ applicability. > >+ > >+ This property contains the string representation of JavaScript code that > >+ will be evaluated in a sandboxed environment using JavaScript's > >+ ``eval()``. > >+ > >+ The string is expected to contain the definition of a JavaScript function > >+ ``filter(context)``. This function receives as its argument an object > >+ describing the client's state. The function's return value is evaluated > >+ in an ``if`` statement to determine whether the experiment is applicable. > >+ > >+ The purpose of this property is to allow experiments to define complex > >+ rules and logic for evaluating experiment applicability in a manner > >+ that is privacy concious and doesn't require the transmission of > >+ excessive data. > >+ > >+ If this property is not defined, the client does not consider a custom > >+ JavaScript filter function when determining whether an experiment is > >+ applicable. > > nit spelling "conscious". > > One of Gregg's requests is that the filter function be able to express some > details about why it failed. I think the best way to do this is by throwing > an exception. So for example a function might do: > > function filter(context) { > var fhr = context.getHealthData(); > > // reject people who can't run our DLL > if (fhr.geckoAppInfo.xpcomabi != "x86_64-gcc3") { > throw context.Rejection("Incorrect xpcomabi: " + > fhr.geckoAppInfo.xpcomabi); > } > > // reject users with conflicting extensions > var addons = fhr.geckoAppInfo.data.last["org.mozilla.addons.active"]; > var inspector = addons["inspector@mozilla.org"]; > if (inspector !== undefined && !inspector.userDisabled && > !inspector.appDisabled) { > throw context.Rejection("DOMI active"); > } > > return true; > } Makes sense to me. I was also thinking that we may want to signal result by calling a function on the context. Or, we could pass a function or two into the filter function. This would allow async stuff to happen in the filter. I imagine we'll find a need for this eventually. > > If they get unique ids, can't we just use a map instead of an array? > > I like the array because it expresses a form of priority. In particular > since we'll only be enabling one experiment at a time, we'll always check > the first experiment before moving on to the next. This is exactly why I made it an array. I should document that. > > What about clients where |startTime| is still in the future? Do they need to schedule activation for |startTime| or is it enough to activate them on some periodic schedule? Same for |endTime| and |maxActiveSeconds|. > > Good question. I think that for now the once-a-day timer would be > sufficient, but if we can provide more precision easily feel free to do that. Clock skew is a hard problem. In theory, we could assume the server's clock (via the HTTP Date response header - assuming we're using HTTP) is canonical and we could compute startTime and endTime from that. But, I'm really tempted to say ignore that problem for now. We do know many clocks in the wild drift pretty crazily. But I don't think it's a P1 feature for initial launch. The timer scheduling shouldn't be too difficult of a problem: when the manifest/applicability is evaluated, set one-time timers for each startTime and endTime. Presumably, we'll need to do this sometime shortly after application startup. > There is one case that comes to mind which is not addressed by the spec: > what should we do if the conditions change once an experiment has already > been installed? We have a couple options: > > A. once an experiment is installed, the conditions are irrelevant > B. honor changes to all of the conditions except the sample rate > C. honor changes to all of the conditions including the sample rate > > I think for v1 (A) would be fine but (B) would be better long-term. So feel > free to either implement B now or punt it to a followup bug. I just realized I only addressed the client's conditions changing, not the manifest's! I think we should eventually get to B or C. The reason is it serves as a relief valve that's not "universally disable the experiment." We may start a study, realize we've gotten enough responses from Windows, then decide we only need to keep it running for OS X and Linux. The choices there are defining a new experiment applicable only to the revised client pool or to adjust the experiment's collection parameters. I like the latter because it doesn't introduce problems such as mapping multiple experiment IDs to the same logical experiment. (But will we have to do that anyway?)
> It's not yet documented, but I'd like the fetch URLs to contain the client > type,, channel and version number of the client. That way, we can maintain a > server-side mapping of client support to manifest version so we never > release an unsupported manifest version to an incompatible client. OK, > "never" is really "rarely" and that case should be limited to "Nightly" when > new manifest versions are implemented. Absolutely. But for now we'll treat it as if we failed to fetch? So no new experiments will be started, but existing experiments will continue? > It's not like we're writing super formal specifications that need to interop > - I think it would be fine if we defined these things but just didn't > implement them on the client nor used them in the wild until clients > supported them. I don't understand why we'd put it in this doc, then. If we can add it later without changing the version number, let's leave it out for now. That way if we never need it it won't be confusing. > I just put in a place holder. I think either nsIXULAppInfo field makes Let's use name then. > I saw both "language" and "locale" somewhere, so I included both. > > I think active language package / locale makes sense. But I can see a case > for distinguishing original from active. Let's just do locale for now, then, and make it the same as the locale we report to FHR here: http://mxr.mozilla.org/mozilla-central/source/services/healthreport/providers.jsm#344 > I'd like to revise it slightly. Instead of adjusting based on the > differences between witnessed manifest values, we record the randomly > generated sampling value from the first attempt. This is a much better idea than mine! > I was also thinking that we may want to signal result by calling a function > on the context. Or, we could pass a function or two into the filter > function. This would allow async stuff to happen in the filter. I imagine > we'll find a need for this eventually. I'm hoping this will all run on a worker anyway, so let's avoid designing an asynchronous system.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #8) > > I was also thinking that we may want to signal result by calling a function > > on the context. Or, we could pass a function or two into the filter > > function. This would allow async stuff to happen in the filter. I imagine > > we'll find a need for this eventually. > > I'm hoping this will all run on a worker anyway, so let's avoid designing an > asynchronous system. Sure. But you still can't guarantee we'll never have to call an async API somewhere, somehow. Making it async prevents an avoidable version bump later.
I incorporated much of the feedback. There are still a few ambiguous pieces. But I figured I'd post something so we have something more active to work on.
Attachment #8386528 - Flags: feedback?(benjamin)
Comment on attachment 8386528 [details] [diff] [review] Define and document Experiments Manifests >+xpiHash >+ (required) String hash of the XPI that implements this experiment. >+ >+ The value is composed of a hash identifier followed by a colon >+ followed by the hash value. e.g. >+ `sha1:f677428b9172e22e9911039aef03f3736e7f78a7`. `sha1` and `sha256` >+ are the two supported hashing mechanisms. The hash value is the hex >+ encoding of the binary hash. >+ >+ When the client downloads the XPI for the experiment, it should compare >+ the hash of that XPI against this value. If the hashes don't match, >+ the client should not install the XPI. >+ >+ Clients may also use this hash as a means of determining when an >+ experiment's XPI has changed and should be refreshed. (TODO is this >+ necessary or do we get this for free with the add-on manager?) We don't get this for free unless we enable addon updating, which I would prefer to avoid. I expect we'll actually change the URL also (e.g. https://telemetry-experiments.mozilla.org/xpis/foobar-1.2.3.xpi) but using the hash as an indicator of when to re-download sounds reasonable. >+maxSessions >+ (optional) Integer number of application sessions this experiment should be >+ active for. No. We're not going to implement this now and documenting something we're not going to implement on the chance that we'll need and implement it in the future is not a good practice. >+frozen >+ (optional) Boolean value indicating this experiment is frozen and no >+ longer accepting new enrollments. >+ >+ If a client sees a tree value in this field, it should not attempt to >+ activate an experiment. sp nit "tree" >+JavaScript Filter Context Objects >+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >+accept() >+ This function should be called if the filter passes and the >+ experiment is applicable to the client. This function takes >+ no arguments. It only needs to be called once. >+ >+reject(reason) >+ This function should be called if the filter is rejected and the >+ experiment is not applicable to the client. >+ >+ This function takes a single argument - a string message describing >+ why the filter failed. That message is used for logging and forensics >+ purposes. It may one day be transmitted to Mozilla or the experiment >+ server operator to help debug filter failures, so the message should >+ not contain any sensitive or identifying information. This really seems waffling between an asynchronous and synchronous execution environment. We need to make a clear decision and fix the API for either one way or the other. This function should be fully synchronous, use a return value, and ditch the accept() function altogether. It's up to you whether we keep using .reject or return or throw errors for rejection. >+2. Retention of expiration information >+-------------------------------------- >+ >+If an experiment's expiration conditions are defined in the manifest, the >+client should persist and honor these conditions. The client should not >+have to rely on the manifest's updating to deactivate an experiment. I get what you mean here but the wording is confusing. Rewrite or provide an example? Also isn't this the same as #7 and #8? >+5. Experiment reappears in manifest >+----------------------------------- >+ >+If an experiment disappears and reappears in the manifest, the client >+should evaluate the experiment as if it just appeared. In other words, >+the fact that an experiment disappears and reappears should not be >+relevant beyond adhering to requirements 3 and 4. This either seems obvious or silly. Why bother mentioning this at all? >+6. Re-evaluating applicability on manifest refresh >+-------------------------------------------------- >+ >+When an experiment manifest is refreshed or updated, the client should >+re-evaluate the applicability of each experiment therein provided it is >+allowed. Re-evaluation is allowed unless an experiment explicitly denies >+it. What does the last bit mean? How would an experiment explicitly deny something? >+9. Calculation of sampling-based applicability >+---------------------------------------------- >+ >+For calculating sampling-based applicability, the client will associate >+a random value between ``0.0`` and ``1.0`` for each observed experiment >+ID. This random value will be generated the first time sampling >+applicability is evaluated. This random value will be persisted and used >+in future applicability evaluations for this experiment. >+ >+By saving and re-using the value, the client is able to reliably and >+consistently evaluate applicability, even if the sampling threshold >+changes. >+ >+Clients should retain the randomly-generated sampling value for >+experiments that no longer appear in a manifest for a period of at least >+10 days. The rationale is that if an experiment disappears and reappears >+from a manifest, the client will not have multiple opportunities to >+generate a random value that satisfies the sampling criteria. Why not keep it for all 180 days? >+ "jsfilter": "function filter(context) { return context.fhr_enabled; }" fhr_enabled is not a property of context. What about an example where we don't install an experiment if the user has adblock plus enabled? r=me with comments addressed.
Attachment #8386528 - Flags: feedback?(benjamin) → feedback+
tl;dr: - probability model revision request - request that the randomization IDs come to runOrNot() / filter(). P.s.: Probability. This is done in test pilot already, if you want prior art, and seems to be something that worked okay! My preferred style is this: - randomizationID: <string> - min: <float, 0,1> - max: <float, 0,1> With these behaviours: 1. check for 'telex.randomids.<key>' 2. USE, or GENERATE if undef. Why? Probability as a proportion is not enough for these reasons - We sometimes want to run an Experiment if user WAS NOT in some other experiment. A simple way to do this is to reuse a previous key. - Arm (variation) randomization is easier this way too. Example: newtab study A: ["newtabstudy", .10, .15] newtab study B: ["newtabstudy", .15, .20] HUGE increase in utility and power > Small amount of code complexity increase.
More notes: 1. Great work so far on this! 2. Max Sessions <- we have never needed this. 3. Updates and signed hashes play together badly. We have deployed bad experiments in the past, and relied on addon updating to fix the mess. I also fully understand why you want to avoid it, and only do addon updates via Manifest change. 4. I lean toward "manifests for a particular experiment can only change in one way --> existing to 'kill'". I would suggest adding an explicit "kill" key, that can be added that does this: If the experiment is running, STOP IT AND UNINSTALL THE ADDON NOW. 5. Agree that removing an experiment from the manifest allows the experiment to continue running, but no longer enrolls new people. 6. When in an experiment, the main reason TelEx should be getting new manifests is to look for KILL orders. Is this the same place where "UPDATE" orders should be? (I can live with that :) ) 6. In past implementations, "runOrNot" has been async/promised, mainly because we have tended the view the experiment process as a 'forward only' async kind of thing. WHY: because getting the existing ADDON LIST is async, and we sometimes want to handle the 'does not have addon X' case.
GEO vs. Locale This is a hornet's nest. Half of EN-US is outside the US. In Unicorn world, I would prefer to actually have "In US, but using *-ES" as a possibility. I realize this is challenging. If I have to choose, I would actually choose GEO over lang. (q: when we do GEO (in AMO download dashboard for example), is it through a 3rd party? I have assumed it was geo-ip-ish).
> 3. Updates and signed hashes play together badly. We have deployed bad > experiments in the past, and relied on addon updating to fix the mess. I > also fully understand why you want to avoid it, and only do addon updates > via Manifest change. I'm not sure what this means. We will be able to deploy new versions with very little effort and no dependency on AMO, so I think we're ok here. > 4. I lean toward "manifests for a particular experiment can only change in > one way --> existing to 'kill'". I would suggest adding an explicit "kill" > key, that can be added that does this: If the experiment is running, STOP > IT AND UNINSTALL THE ADDON NOW. We have that via the "disabled" flag. > 6. When in an experiment, the main reason TelEx should be getting new > manifests is to look for KILL orders. Is this the same place where "UPDATE" > orders should be? (I can live with that :) ) Yes. > WHY: because getting the existing ADDON LIST is async, and we sometimes > want to handle the 'does not have addon X' case. That is not necessary because the FHR payload has this data already. (It will be precomputed). > This is a hornet's nest. Half of EN-US is outside the US. In Unicorn > world, I would prefer to actually have "In US, but using *-ES" as a > possibility. I realize this is challenging. > > If I have to choose, I would actually choose GEO over lang. For v1, the translation experiment requires locale and not geo. We'd have to invent a new mechanism for geo (either geo-IP or using geolocation) so if we do need geo for your experiment let's break that out into a different bug. It's schedule risk that you might have to implement, though! > (q: when we do GEO (in AMO download dashboard for example), is it through a > 3rd party? I have assumed it was geo-ip-ish). The various log/server dashboards are based up geo-ip lookups.
> 3. Updates and signed hashes play together badly. We have deployed bad > experiments in the past, and relied on addon updating to fix the mess. I > also fully understand why you want to avoid it, and only do addon updates > via Manifest change. I'm not sure what this means. We will be able to deploy new versions with very little effort and no dependency on AMO, so I think we're ok here. ==== I am unclear on how the 'deploy new versions' works? Does one update the HASH in the manifest file for an experiment? Then it's up to TelEx to notice the hash changed, and reinstall? If so, will that appear to be an 'update' to the addon? (btw, we have never had to update more than twice for a particular experiment)
Yeah, we'd update the URL and the hash to point to a new version of the XPI: existing clients would download and install the new XPI and new clients would install the new version.
gps, did you have a chance to decide on the manifest URL format yet?
Flags: needinfo?(gps)
It just came up that it would be useful, depending on the experiment, to exclude clients with bad system clocks (if we have good options for detection). bsmedberg, would that be something sensible for v1 or more likely v2+? At least the translation experiment wouldn't require it.
Flags: needinfo?(benjamin)
Defer till later.
Flags: needinfo?(benjamin)
Attachment #8382660 - Attachment is obsolete: true
Attachment #8382660 - Flags: feedback?(glind)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #11) > >+JavaScript Filter Context Objects > >+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > >+accept() > >+ This function should be called if the filter passes and the > >+ experiment is applicable to the client. This function takes > >+ no arguments. It only needs to be called once. > >+ > >+reject(reason) > >+ This function should be called if the filter is rejected and the > >+ experiment is not applicable to the client. > >+ > >+ This function takes a single argument - a string message describing > >+ why the filter failed. That message is used for logging and forensics > >+ purposes. It may one day be transmitted to Mozilla or the experiment > >+ server operator to help debug filter failures, so the message should > >+ not contain any sensitive or identifying information. > > This really seems waffling between an asynchronous and synchronous execution > environment. We need to make a clear decision and fix the API for either one > way or the other. This function should be fully synchronous, use a return > value, and ditch the accept() function altogether. It's up to you whether we > keep using .reject or return or throw errors for rejection. I'm not sure how this is waffling. The new API is fully asynchronous. Return values are irrelevant. I could change this to be a promise that is resolved on applicable or rejected on inapplicable. But that's about the only change I could see being made. > >+2. Retention of expiration information > >+-------------------------------------- > >+ > >+If an experiment's expiration conditions are defined in the manifest, the > >+client should persist and honor these conditions. The client should not > >+have to rely on the manifest's updating to deactivate an experiment. > > I get what you mean here but the wording is confusing. Rewrite or provide an > example? Also isn't this the same as #7 and #8? It's really the same as #3. I'll merge them. > >+5. Experiment reappears in manifest > >+----------------------------------- > >+ > >+If an experiment disappears and reappears in the manifest, the client > >+should evaluate the experiment as if it just appeared. In other words, > >+the fact that an experiment disappears and reappears should not be > >+relevant beyond adhering to requirements 3 and 4. > > This either seems obvious or silly. Why bother mentioning this at all? Yes, it's redundant. I wanted to document it explicitly. I can probably merge it. > >+Clients should retain the randomly-generated sampling value for > >+experiments that no longer appear in a manifest for a period of at least > >+10 days. The rationale is that if an experiment disappears and reappears > >+from a manifest, the client will not have multiple opportunities to > >+generate a random value that satisfies the sampling criteria. > > Why not keep it for all 180 days? This isn't forbidden by the spec.
(In reply to Georg Fritzsche [:gfritzsche] from comment #18) > gps, did you have a chance to decide on the manifest URL format yet? I haven't decided exactly what it will be. But that shouldn't matter. Assume the URL will be stored in a preference and that we'll use Services.urlFormatter.formatURLPref(PREF_NAME) to obtain the final URL. See toolkit/components/urlformatter/nsURLFormatter.js and go to about:config and filter on '%' to see how this is used in practice.
Flags: needinfo?(gps)
The jsfilter object should be fully synchronous. Please remove all references to the asynchronous API.
The URL format does matter if we're actually going to get it reviewed and landed this week...
(In reply to Gregg Lind (User Research - Test Pilot) from comment #12) > - We sometimes want to run an Experiment if user WAS NOT in some other > experiment. A simple way to do this is to reuse a previous key. OK. That requirement is worth capturing. We should be able to do this via filter functions by looking into the FHR payload for data from previous experiments. Would this be sufficient? Or, do we need to track experiment activation for longer than 180 days? > - Arm (variation) randomization is easier this way too. > > Example: > > newtab study A: ["newtabstudy", .10, .15] > newtab study B: ["newtabstudy", .15, .20] I want to say we could do this by having two experiments, each with 5% sampling. But something in the back of my head is telling me that the math works out to different results (essentially the first items in the list get priority over the others). Does it? Could we work around this by randomizing the order the server releases items in the manifest? Having different "slices" of a sampling population use different XPIs is a significant change to the manifest format. Can we defer this to v2?
(In reply to Benjamin Smedberg [:bsmedberg] from comment #24) > The URL format does matter if we're actually going to get it reviewed and > landed this week... But it doesn't block Georg on implementing the URL code, which is why I answered that way. (In reply to Benjamin Smedberg [:bsmedberg] from comment #23) > The jsfilter object should be fully synchronous. Please remove all > references to the asynchronous API. I strongly disagree with this opinion. I'm very tempted to change the FHR and Telemetry payload accessors from properties to functions. Why should the client have to populate data that isn't used? That's expensive. In the case of Telemetry, I believe there is even a certain amount of jank involved (to collect some of the memory measurements). Async allows us to just-in-time populate missing values as a perf optimization. Plus, async is the JavaScript way. There are lots of async-driven APIs in the JavaScript language. Array.forEach being a trivial example (that won't prove my point). Can you guarantee all of them operate on the same event loop tick? Will the JS language never grow a core async API that spans multiple ticks? Async is the right solution here.
I will think on this. Different experiments is fine, but knowing FOR SURE that you are getting a 'clean' person (or, sometimes making sure you get previously exposed people) is pretty easy if you can see the previous randomization in the 'runOrNot' filter function. 180 days is probably fine.
Addressed more feedback.
Attachment #8388823 - Flags: review?(benjamin)
Attachment #8388823 - Flags: feedback?(glind)
Attachment #8386528 - Attachment is obsolete: true
> I'm very tempted to change the FHR and Telemetry payload accessors from > properties to functions. Why should the client have to populate data that > isn't used? That's expensive. In the case of Telemetry, I believe there is > even a certain amount of jank involved (to collect some of the memory > measurements). There is no reason we cannot use a worker and continue with a synchronous programming paradigm and still make these lazy function calls/properties. The synchronous system makes for code which is easier to read and write. This is my decision, not just an opinion.
I'd like the URL to have at least the following components: * channel * version * name Something like: https://<hostname>/%NAME%/%CHANNEL%/%VERSION%/experiments.json We could throw locale, OS, architecture, etc in there. But I don't think it buys us much. The primary goal of the URLs is to minimize the payload size going to clients and to give Ops-y control and insight into client behavior. If we end up varying applicability by other parameters, we can always throw them into the URL later. I made this proposal with the assumption that we'd deploy either transparent server-side URL rewriting or we'd have dynamic code intercept the entire URL space. I'm guessing the former because of CPU constraints with the initial launch (rewrites in C via nginx, apache, load balancer, etc are faster than {Python, Perl, etc}).
Make filter() synchronous.
Attachment #8388846 - Flags: review?(benjamin)
Attachment #8388846 - Flags: feedback?(glind)
Attachment #8388823 - Attachment is obsolete: true
Attachment #8388823 - Flags: review?(benjamin)
Attachment #8388823 - Flags: feedback?(glind)
(bikeshed risk - ignore if it creates more work!) 1. What ops vars does ops actually want in that url? I worry a bit about optimizing it at all. Maybe this should wait until the 'reporting' infrastructure is more settled? If our logging already knows how to parse request headers to agg stats, then a fancy url might not be necessary. 2. Support a <host>/experiments.json fallback? Having a static load-balanced file worked efficiently for TP1
Comment on attachment 8388846 [details] [diff] [review] Define and document Experiments Manifests Feedback: - This looks great. - `maxActiveSeconds` should be required - `startTime`: optional? (allowing `0` as 'whenever' is nice, but gross. Are negative values allowed?) - nit: either label all keys as 'required' or optional. Some are currently labeled with neither. I am reading this as "optional, unless required". - as written, there are a lot of complicated possible interactions between all the timing keys. I believe they are all necessary. - I still don't like the `sample`, and would strongly prefer: * sample-key * sample-min * sample-max I have needed this in the past, and it's not an idle. This sampling regime should also be exposed to `filter()` Filter ========== a. Any reason not to expose the `prefs` tree here? In the past, wrong pref settings have been a (common) reason to reject. (examples: home page, search prefs) b. Cramming `evalable()` js into this JSON feels... not awesome. Having a good troubleshoot environment would be handy, and I would like a 'fuzzer' tool for this eventually, to ensure it returns true and false when it should. > RE: 4. Re-evaluating applicability on manifest refresh How does this behaviour actually work? There are so many possible interactions with all the timing options here. In previous work, I have limited this to changing the - ADDON HASH: updates the addon (as in #10) - DISABLED: causes a ADDON_UNINSTALL - FROZEN: no longer evaluated or sends filter: "frozen" Following this advice would remove 5 and 6. In the past I have implemented this by recording the experiment validity determination by ID.
New request / suggestion: "information url" [required] or such, which links to any or all outside information about the experiment. It's possible to embed this in the description, but better to have explicit
(In reply to Gregg Lind (User Research - Test Pilot) from comment #33) > - I still don't like the `sample`, and would strongly prefer: > * sample-key > * sample-min > * sample-max > > I have needed this in the past, and it's not an idle. > This sampling regime should also be exposed to `filter()` Is that something you'd require for the planned v1 experiments or can we just tackle this in v2?
We aren't going to do sample-* for this iteration. Future iterations may do automatic bucketing of people into multiple groups, but for v1 let's do the `sample` thing that is easy to calculate.
Comment on attachment 8388846 [details] [diff] [review] Define and document Experiments Manifests appName shouldn't be optional, and neither should channel. Mistaken deployment on either of those would be unfortunate ;-) I don't think that the health*Enabled/telemetryEnabled keys are necessary: experiments shouldn't run at all if either FHR or telemetry is disabled since we can't collect the data anyway. Remove the extra refs to accept()/reject() and the sentence "The return value of the ``filter()`` call is ignored." which is no longer true. r=me with those changes
Attachment #8388846 - Flags: review?(benjamin)
Attachment #8388846 - Flags: review+
Attachment #8388846 - Flags: feedback?(glind)
This can land now with the above fixes.
Blocks: 974009
No longer blocks: 973999
https://hg.mozilla.org/integration/fx-team/rev/ffd85374ab3f While I was here, I made these docs hook up to the in-tree Sphinx docs. Patch was very minimal.
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 31
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: