Closed Bug 863440 Opened 8 years ago Closed 8 years ago

Proposal for FHR document payload format v.Next

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dre, Assigned: rnewman)

References

Details

Attachments

(2 files)

This is a proposal for a substantial change to the FHR schema to support better tracking of data when the payload version changes or when some factor of application info such as build number, memory, etc. changes.

Currently, a change in the schema causes a problem with how to report the mixture of old and new data.  The problem is especially visible on the day a change occurs because some sessions for the day were recorded with the old schema and some with the new schema.  If we throw out the old one or attempt to merge them, we lose important data or we risk corrupting the new data with the old leading to potentially incorrect analysis.

A similar problem exists when the user upgrades the application or makes other changes to the environment such as installing new memory, upgrading or disabling an add-on, etc.  Without longitudinal tracking of the app info data, any analysis that focuses on certain factors such as build number becomes unreliable when looking at the data prior to the current session that submitted.

A proposal for a way to solve both of these issues and have a schema that is very future-proof is laid out below.

Synopsis:
Whenever a session starts, the data for the session is keyed with a hash of all the app info variables.  Whenever an app info variable changes, the hash and the variables that changed (i.e. the diff) is stored to be part of the submission payload.  Each date in the dates section of the payload contains one or more objects whose key is the hash value and whose value is the collection of session information pertaining to that set of app info values.

Workflow:

* On app / session startup:
** Retrieve all current app info values.
** Compute a hash for the set.
** Key the new session data collection with this hash.
** Check whether this hash and value set has been previously persisted. If not, persist it.

* On app / session start after an abnormal termination:
** Persist and finalize the previous session data into storage under the given hash key with the session end flag of "abnormal termination".

* On app / session close:
** Persist and finalize session data into storage under the given hash key with the session end flag of "normal termination".

* On dynamic app info change (i.e. change of restartless add-on or enable/disable of plug-in):
** Finalize the current session with the time of the change and record the session end flag of "app info change".
** Start a new session with a startup time of the change and a session start flag of "app info change".

* On document construction for submission:

** Construct the dates object of the payload.
*** Add a day object with the date as the property name for every day we have session info to record.
**** Remember the app info hash keys that were used for each date in a set of all observed hash keys.
**** Write each hash key as a property name under the date object with all of the session data for that app info hash as the value.

** Construct the appInfo object of the payload.
*** Add an object to the appInfo object with a property name of "current" and a value of all app info values for the current session.
** Add each observed hash key as an additional property to the appInfo object with the hash key as the property name and the value being only the app info variables that have different values from the current app info.  Note that this means we will have a property with the name being the current hash key and the value being an empty object.

* On removal of out of date entries:
** Remove out of date entries as normal.
** Scan remaining dates and remember the observed hash keys used in those dates.
** Remove any app info records that do not exist in the observed set of hash keys.

For the purposes of this document, we should consider "appInfo" to include all info about the environment (appInfo + sysInfo) to cover things like memory size and operating system version.
Attachment #739263 - Flags: feedback?(mconnor)
Attachment #739263 - Flags: feedback?(gps)
Comment on attachment 739263 [details]
Example JSON document following proposed format

Initial reaction is good and I think most of what you propose is doable. I think we should have all the engineers get on a call together to flush out some things - as I think it will take a lot of back and forth. Perhaps dumping this on an etherpad [and allowing inline comment] is a good first step?
Attachment #739263 - Flags: feedback?(gps) → feedback+
I think a hack / discussion session would be great.

I'm also attaching a WIP version of a translator that Mark did.  It doesn't translate to precisely the format described in the proposal yet (it leaves a lot of data still in the days segments), but it would be a good starting point for further experimentation.
I was going to suggest the 'hash' doesn't really need to be a hash and could instead by any consistent identifier (say an integer). However, I'm guessing you want the identifier to be globally consistent to make server-side aggregation easier? If that's an objective, perhaps we should limit the hash to more common properties (build ID, platform, etc) and not include add-ons?
We really need a Data Format component.
I'd settle for a Client Shared component.
When we were working on the proposal, hash meant a true stable hash function. Since then, however, I discovered there aren't any good clean is hash implementations that seemed ripe for use, and there are separate challenges and concerns about making them stable across documents. If anyone wants to dig into those, I am happy to do so, but what it comes down to is that I would be happy to have a simple identifier that is only guaranteed to be stable for a single profile, or even a single document.
The major value I see here is reduction of duplicate data, clarity of links between measurements and potentially correlated factors, and straightforward versioned data lineage. A true hash or a long term stable function are not hard requirements for any of those.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #0)

Thanks for the thorough elucidation!

Some questions, as I'm implementing storage on Android right now.



> * On app / session startup:
> ** Retrieve all current app info values.

Note that this is fine if you limit the definition of "app info" to "app info values that are available natively to Java" (e.g., channel, build ID). Two reasons:

* Gecko sessions are very frequent in Android, and should really be ignored. Gecko can start and stop several times during a browsing session, depending on your definition of "session".
* Gecko data is not available when document generation occurs.


> * On app / session start after an abnormal termination:

Again, this depends on how you define "abnormal termination". I'm not sure if there's a sane definition of abnormal termination on Android. If we're lucky, we'll be able to tell if the previous activity lifecycle ended in a crash, by omission, but that's something of a heuristic (how do we know that the activity isn't still running? need to wait for it to start again…), and in general we're at Android's mercy here.


> * On dynamic app info change (i.e. change of restartless add-on or
> enable/disable of plug-in):

Add-on data might not be feasible as part of app info on Android; depends on whether it's managed by Gecko or Java. Why do you want this transition? You elsewhere mention "appinfo + sysinfo", and addons aren't part of that.
(In reply to Richard Newman [:rnewman] from comment #7)
> (In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #0)
> Some questions, as I'm implementing storage on Android right now.
> 
> > * On app / session startup:
> > ** Retrieve all current app info values.
> 
> Note that this is fine if you limit the definition of "app info" to "app
> info values that are available natively to Java" (e.g., channel, build ID).

Seems reasonable to me that the list of attributes that are packaged up as an "environment" might be different for Android.

> Two reasons:
> 
> * Gecko sessions are very frequent in Android, and should really be ignored.
> Gecko can start and stop several times during a browsing session, depending
> on your definition of "session".
> * Gecko data is not available when document generation occurs.
> 
> 
> > * On app / session start after an abnormal termination:
> 
> Again, this depends on how you define "abnormal termination". I'm not sure
> if there's a sane definition of abnormal termination on Android. If we're
> lucky, we'll be able to tell if the previous activity lifecycle ended in a
> crash, by omission, but that's something of a heuristic (how do we know that
> the activity isn't still running? need to wait for it to start again…), and
> in general we're at Android's mercy here.
> 

I think it makes sense to tailor the session definitions to the Android world and not try overly hard to tie them to the Gecko sessions seen in desktop.

One of the most important reasons for evaluating session at all is to be able to provide a denominator for calculations of stability (# crashes / session duration) and such.  As long as we can track suitable information to make those calculations, we'll be fine.

> 
> > * On dynamic app info change (i.e. change of restartless add-on or
> > enable/disable of plug-in):
> 
> Add-on data might not be feasible as part of app info on Android; depends on
> whether it's managed by Gecko or Java. Why do you want this transition? You
> elsewhere mention "appinfo + sysinfo", and addons aren't part of that.

Add-on configuration change (i.e. installation, dis/enabling or update) are also part of the environment and if the add-ons are likely to have an impact on performance and stability, it is important to try to track them as they change over time.

This need has to be balanced by what we can reasonably collect from the Android platform.  So far, I haven't seen restartless add-ons, so it likely wouldn't be as much of an issue.  Maybe we could just make sure that the Gecko part that manages the add-ons writes the notice of a change in a place that the Android data collection can see and re-act?  Happy to hear ideas here.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #8)

> I think it makes sense to tailor the session definitions to the Android
> world and not try overly hard to tie them to the Gecko sessions seen in
> desktop.

OK, so your goal here is simply to identify/bundle changes across major environmental variables, where the definition of 'major' is platform-specific (but presumably broad enough to address Bug 851544 by ensuring that each multi-version day includes multiple session hashes).

Sounds good to me.

> This need has to be balanced by what we can reasonably collect from the
> Android platform.  So far, I haven't seen restartless add-ons, so it likely
> wouldn't be as much of an issue.

There are restartless add-ons. (Even my JNI ones that call into Java for testing are restartless, which is pretty awesome.)

Recording at runtime is actually less of a concern for me. I just want to make sure that we don't need access to Gecko data (e.g., add-ons) at any point that isn't driven by Gecko -- i.e., at document collection time, when Gecko probably isn't running.

> Maybe we could just make sure that the
> Gecko part that manages the add-ons writes the notice of a change in a place
> that the Android data collection can see and re-act?  Happy to hear ideas
> here.

My only visceral objection to that is that it introduces yet another location for add-ons state, and we already know that add-ons is our least reliable data source (due to add-ons manager complexity). The possibility of the Java FHR-side record of add-ons diverging from Fennec/Gecko's worries me -- even if we periodically blew away and refreshed our cache, that would introduce false transitions and reduce the value of our conclusions.

Of course, we'll almost certainly have to bite this bullet in order to report add-ons at all, so :/


While I'm writing: are you happy with "_HASH_1 is an empty object" being enough of an indicator for "this is the current state"? I don't see any kind of sequential record, direct object reference, etc., and I wonder if that's intentional.
(In reply to Richard Newman [:rnewman] from comment #9)

> My only visceral objection to that is that it introduces yet another
> location for add-ons state, and we already know that add-ons is our least
> reliable data source (due to add-ons manager complexity). The possibility of
> the Java FHR-side record of add-ons diverging from Fennec/Gecko's worries me
> -- even if we periodically blew away and refreshed our cache, that would
> introduce false transitions and reduce the value of our conclusions.
> 
> Of course, we'll almost certainly have to bite this bullet in order to
> report add-ons at all, so :/

I'm open to suggestions on a reasonable middle path here. Agree it is a difficult problem.



> While I'm writing: are you happy with "_HASH_1 is an empty object" being
> enough of an indicator for "this is the current state"? I don't see any kind
> of sequential record, direct object reference, etc., and I wonder if that's
> intentional.

I am fine with the empty object being the current one. As I was going through the logic, I convinced myself that there couldn't be more than one empty object and that the empty one would always be the current config.  If I overlooked a possible way that might not be a true indicator, or if we just want to be cautious, I'd also be fine with the current env object having the hash name specifically indicated in it.

Also, just to make it clear, I'm completely open to suggestions on what exactly the identifiers are. "_HASH_#" is ~okay.  Simple autoincs are fine too.  We could probably get a little fancier and use the build id + autoinc to have a key that would pack a little bit of information into it as well.
Before we dive too far into much-needed bike shedding, I think it's important to address questions over changes to the longitudinal recording of data incurred by this change. If I'm understanding the proposal correctly, each "hash" change will be accompanied by a delta from the previous hash. This effectively constitutes longitudinal recording of things we don't yet record longitudinally (e.g. add-on installs). Furthermore, we have also decreased the granularity of reporting from 1 day to sub-day. This seems like a decrease in privacy and thus I believe it warrants scrutiny.

Is this per-hash delta required? Must we store a delta of every field that went into the hash? If not, what's in and what's out?

On the technical side, I assume fields inside the per-hash delta can be excluded from longitudinal recording (for they would be redundant)?
I do agree it wouldn't hurt to have privacy review, but I am not sure if it makes more sense to review before, after, or during our sorting outstanding questions about the proposal.

In the currently collected add-on data, there are fields for installation and upgrade date.  That is 95% of the longitudinal information we would be collecting in this proposed format.  The only thing that is new there is the ability to properly associate metrics with the add-on configuration if it changes multiple times.

As far as changing information to be tracked longitudinally, the biggest thing is the relocation of the platform information such as CPU, Memory, and OS version.

As for sub-day granularity, note that we will only have multiple sections for a day if there was some significant configuration change.  This is important to evaluate for a couple of reasons:
1. Being able to record the information from multiple provider schemas
2. Changes in performance or stability are most likely correlated with some configuration change, and if we lump all the metrics collected for a day together, we conflate the metrics for one configuration with the ones for the other. That leaves us with the poor choice of having to throw out all of the data for that day.
(In reply to Gregory Szorc [:gps] from comment #11)
> On the technical side, I assume fields inside the per-hash delta can be
> excluded from longitudinal recording (for they would be redundant)?

The data structure under the "days" section will contain just the environment identifier (the hash) and the longitudinal measurements such as counts and timings.  None of the data that appears in the environment section would be repeated.
Blocks: 865238
Blocks: 865254
FTR, in the course of implementing this on Android I'm electing to include a pointer to the current hash from 'current'.

Question:

In the sample payload, the 'diff' between each environment is ambiguous. See:

    "_HASH_3": {
      "version": "21.0a1"
    }

does that version refer to:

        "version": "22.0a1",

in geckoAppInfo, or

        "version": "3.5.0-26-generic"

in sysInfo? The diffs need to be structured to match.
Another thought: achieving environment deltas could get expensive (at least in the mobile context). In order to tell you that an add-on changed version, or the user enabled telemetry, we need to store the complete data for every environment in the past 180 days.

(It's every environment, not just the last, because we need to be able to compare each of them to the current; if you switch from envA:v20 to envB:v21 and back to envC:v20, we should not report the version as changed in the envA delta, even though it changed in each step.)

For a user on beta, assuming 5 betas per release, and one release every six weeks, that's 21 environments to track. Each time an add-on changes, add another (or two, if they later stop using it!).

There are 22 fields in this set (over six months, 462 cells), plus the total set of active add-ons.

I'm gonna start building it, but this raised my eyebrows.
(In reply to Richard Newman [:rnewman] from comment #14)
> FTR, in the course of implementing this on Android I'm electing to include a
> pointer to the current hash from 'current'.
> 
That is fine, what does the pointer look like?


> Question:
> 
> In the sample payload, the 'diff' between each environment is ambiguous. See:
> 
>     "_HASH_3": {
>       "version": "21.0a1"
>     }
> 
> does that version refer to:
> 
>         "version": "22.0a1",
> 
> in geckoAppInfo, or
> 
>         "version": "3.5.0-26-generic"
> 
> in sysInfo? The diffs need to be structured to match.

Yes, I agree the diff should not be flattened, it should match the structure of the env objects to avoid ambiguity.  

Have you looked at this JSON diffing library? It may not be applicable in this precise instance, but the methodology is very nice.
http://benjamine.github.io/JsonDiffPatch/demo/index.htm
(In reply to Richard Newman [:rnewman] from comment #15)
> Another thought: achieving environment deltas could get expensive (at least
> in the mobile context). In order to tell you that an add-on changed version,
> or the user enabled telemetry, we need to store the complete data for every
> environment in the past 180 days.
> 
> (It's every environment, not just the last, because we need to be able to
> compare each of them to the current; if you switch from envA:v20 to envB:v21
> and back to envC:v20, we should not report the version as changed in the
> envA delta, even though it changed in each step.)
> 
> For a user on beta, assuming 5 betas per release, and one release every six
> weeks, that's 21 environments to track. Each time an add-on changes, add
> another (or two, if they later stop using it!).
> 
> There are 22 fields in this set (over six months, 462 cells), plus the total
> set of active add-ons.
> 
> I'm gonna start building it, but this raised my eyebrows.

I agree it is a worthwhile concern.  It would help a lot if we can finish the transformation script we wrote that would turn a current desktop format into something resembling this new format so we could explore the edge cases (and the normal cases).

One solution I think would definitely mitigate this problem is a true VCS model like git or hg that applies a series of changes rather than always comparing the current to each version.  It might be possible to do something like that with the model used in JsonDiffPatch, but I wasn't sure we could come up with a suitable implementation or whether it would be overly complex, and since it wasn't well laid out, I didn't put it in the proposal.
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #16)

> That is fine, what does the pointer look like?

Simply

  current: {
    hash: "abcdef…",    // SHA-1
    org.mozilla.profile.age: {
      ...
    },
    ...
  }

It would also make me happy to do

  {
    lastPingDate: ...
    currentEnv: sha1
    environments: {
      sha1: { ... }
      othersha1: { ... }
    }
    ...
  }

> Yes, I agree the diff should not be flattened, it should match the structure
> of the env objects to avoid ambiguity.  

Done.

> Have you looked at this JSON diffing library? It may not be applicable in
> this precise instance, but the methodology is very nice.
> http://benjamine.github.io/JsonDiffPatch/demo/index.htm

We store the environments in the DB, and populate a Java object, so diffing for us is actually just comparison of object fields. Not too bad, considering that it's Java!
Two more things.

1. Each day's data will use the full hash, not "current". That's to keep the payload stable. Use environments->current->hash to get the hash itself.

2. Sessions.

My proposal is:

* Record a 'begin' event with a unique session ID on activity start/resume.
* Record an 'end' event with the same ID on activity pause.
* Reconcile these in document generation.

A crash or improper shutdown is signified by two consecutive 'begin's. Sessions can (infrequently but safely) cross date boundaries.

Note that sessions correspond to Android behavior, not to Gecko usage. We are assuming that the content provider will stay alive between sessions (and of course writes will occur on background threads), and thus DB overhead would be minimal.

Resulting data might look like:

  "2013-01-01": {
    "abcdef": {
      "org.mozilla.gecko.sessions": [
        {"duration": 1234,
         "init": 123,
         "sessionRestore": 45,
         "end": "pause"},
        {"maxDuration": 123,
         "init": 45,
         "end": "abnormal"}]}}

In order to provide session time for abnormal termination, we'd need to record ticks.

How do you folks feel about this?

Urgent response requested.
Flags: needinfo?(deinspanjer)
Related: FHR document generation is unlikely to occur during browser usage. Do you need "last", (perhaps except inside about:healthreport?), given that we'll be reporting fine-grained sessions?

If you don't, then we are much closer to done.
The session recording and hash generation you describe sound fine to me.

Outside of health report, I don't think we need the last session.
Flags: needinfo?(deinspanjer)
For the web folks: here's the starting point for how these docs will look.

Blank document, as generated on first run:

  http://rnewman.pastebin.mozilla.org/2462775

After installing an add-on (now two environments):

  http://rnewman.pastebin.mozilla.org/2462776

Ones with data coming shortly.
And with search data for two environments:

http://rnewman.pastebin.mozilla.org/2462941
Current sessions format:

http://rnewman.pastebin.mozilla.org/2485446

Durations in msec. Reasons are "pause", "destroy", "orphan", corresponding to "the user left the browser", "the OS terminated the browser without calling onPause (should never happen)", and "a session was begun but never ended, either due to crash or upgrade".

Thoughts?
Flags: needinfo?(deinspanjer)
Scratch the remark about "destroy", I'm ripping that out and trusting that onPause is called. Saves a prefs write.
I'm happy with the reasons pause and orphan.  For the desktop product, session duration is measured in "ticks" which are approximately 5 seconds each.

If mobile has no similar mechanism, I think it might still be more useful to submit in seconds rather than millis since millis will lead to huge numbers whose precision has little to no real value.  Session duration is a measure of engagement and interaction with the product, and will usually be rolled up into buckets of minutes and hours.  We might occasionally perform a study that looks at short sessions to measure them in seconds, for instance, to look at the number of seconds before a "startup crash".  Even in that case, I don't think we'd care about the millis.
Flags: needinfo?(deinspanjer)
Next action item for this is to write up the actual format as landed, and dump it in docs.services. I'll take that.
Assignee: nobody → rnewman
Status: NEW → ASSIGNED
Duplicate of this bug: 889533
This will be up to date when the next automated push occurs (~30m):

https://docs.services.mozilla.com/healthreport/index.html

Reopen if you need more docs than this + the built-in FHR glossary can provide!
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Attachment #739263 - Flags: feedback?(mconnor)
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.