Closed Bug 1240849 Opened 8 years ago Closed 8 years ago

Firefox Dashboard: engagement ratio

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Assigned: mreid)

References

(Blocks 1 open bug)

Details

User Story

Definition of DAU(day): calculate the total number of active users for that date using main_ping.subsessiondate == `day`. Because of data lag, this calculation will not be final until 10 days after `day`.

Definition of MAU(day): calculate the total number of active users over a 28-day window `day` - 28 < main_ping.subsessiondate <= `day.

On the final dashboard, for each date `E`, display the following calculation:
  MEAN(DAU(d) for `E` - 7 < d <= `E`) / MAU(E)

Display this ratio on the main Firefox dashboard https://metrics.services.mozilla.com/firefox-dashboard/ with daily granularity.

Attachments

(1 file)

14.47 KB, application/json
rvitillo
: review+
Details
For 2016, one of the key metrics that we will be tracking for the year is engagement ratio. In general this is DAU/MAU (daily active users divided by monthly active users). In the user story I've included a draft of how I think we should calculate this in detail for the executive dashboard. I'm looking for feedback from rweiss and jjensen or a delegate that this is a correct and implementable user story.

The reason I'm average DAU over a 7-day period is that we know we have significant fluctuations in DAU on weekdays compared to weekends, and I don't think it's valuable to have the executive rollup to show the noise from those weekly fluctuations. We could average it over the entire 28-day period, but that would introduce significant lag.

We should also expect, over the year, to be asked to provide the engagement ratio broken down by all kinds of subgroups, for example:

* people who installed Firefox for the first time in a particular day/week
* people using Windows 10 versus 7 versus XP
* people using addons versus no addons
* people in various funnelcakes or other experiments
* people who use various features heavily versus lightly

So in addition to providing this toplevel number to the main Firefox dashboard, we should have an architecture in place that can provide segmented data relatively quickly.
Flags: needinfo?(rweiss)
Flags: needinfo?(jjensen)
Priority: -- → P3
Hi Benjamin,

1) I agree we should calculate a 7-day moving average of DAU. The raw DAU data can/will be available of course for ad hoc, more timely analyses, but broader reporting should be with the 7-day moving average.

2) I agree we should ensure that we calculate it in such a way as to be able to do so for subpopulations, like the list you suggest.

Based on past experience with people defining measures in slightly different ways, I think it's important that we create a canonical definition of *precisely* how this is calculated. There is an assortment of tiny little details involving which packets, timestamps, and IDs get counted that needs to be sorted out. I don't think it will be contentious at all, just some necessary definition work.

So I am going to suggest that a member of my team and yours meet to iron those out and agree on a precise piece of code to calculate DAUs and MAUs generally and for individual subpopulations. That code and/or resulting datasets will be *the* place where everyone should look.

Similarly, and while this is a separate issue, we should do some analytical work to, as best as possible, understand the likely patterns of this ratio. What seasonality can we expect? What variance? What levels of change should induce excitement or disappointment? What are some preliminary indicators of meaningful covariates? Etc. My team can dig into it this quarter.
Flags: needinfo?(jjensen)
Hi Benjamin,

As discussed, I've asked Dave Zeber to work with you or someone on your team to prepare some code or pseudocode as canonical documentation for calculating DAU and MAU for both general and sub-populations.
Dave, I'd like feedback on the user story field, which I hope has the necessary rigor to be the specification. When do you expect to be able to provide feedback? This is a fairly urgent project, given that it's the basis for most of our 2016 metrics.
Flags: needinfo?(dzeber)
Blocks: 1198541
bsmedberg, one concern I have right now with this formulation is that the denominator doesn't seem to be rolling as well.  Am I wrong for this?

For example, consider the case where E = February 1st.  The numerator will be computed from the average of the last week of January and Feb 1st, but the MAU count will be February?  Maybe it should be MAU(E) - 1 month?
Flags: needinfo?(rweiss)
Added needinfo for comment above from bsmedberg and dzeber.
Flags: needinfo?(benjamin)
I'll inject something to start with here:

Engagement ratio = DAU'/MAU'

Where:

DAU' = moving average of the DAU over the previous 7 days
MAU' = moving average of the MAU over the previous 30 days

DAU = Daily Active Users: total count of unique client IDs in the population of interest with at least one subsession of activity during the UTC day
MAU = Monthly Active Users: total count of the unique client IDs in the population of interest with at least one subsession of activity over the previous 30 UTC days

"Population of interest":
- For example, "all profiles", "all Nightly channel", "all Windows profiles", etc.

Reporting:
- The Engagement Ratio will be calculated daily.
- The Engagement Ratio on the last day of each month will be recorded as the Engagement Ratio for that month.

Dave -- comments?
The definition of the metric given in the user story looks fine to me. 

One point is that we want to maintain a separation between the definition of the metric as a function of the inputs (DAU, MAU) and the actual computations of DAU and MAU, so that it remains applicable across multiple products - and I think this has been accomplished in the discussion above.

To reiterate what has been laid out:

At a high level:
- DAU(day) = # active users on a single day
- MAU(day) = # users active on at least one day in a 28-day period leading up to the given day
- ER(day) = mean(DAU over 7 days leading up to given day) / MAU(day)

I agree with the smoothing in the numerator and rolling in the denominator. Smoothing over 7 days removes the relatively large day-of-week effect we see in our activity numbers while remaining sensitive enough to small-scale changes.

To implement for a specific product we need a definition of "active on a given day" that will plug into the metric computation. I agree with John that this should be a piece of code, say an isActiveOnDay() function that is implemented for each product we want to track.

For Desktop, this should probably center around main_ping.subsessiondate == `day` as mentioned, but it would be good to do a little validation to make sure any edge cases are either handled or ignored with the expectation that they will "average out". I'll take a look at this in the next couple of days.
Flags: needinfo?(dzeber)
Flags: needinfo?(benjamin)
Thomas, the backend parts of this are specified and ready for high-priority engineering. Should be part of the next iteration.
Flags: needinfo?(thuelbert)
we'll pack it in the sprint starting feb 8, thanks!
Flags: needinfo?(thuelbert)
I like what I see in comment 6 from jjensen except for:

"- The Engagement Ratio on the last day of each month will be recorded as the Engagement Ratio for that month."

What if the we had the engagement ratio of any given month the average of all of the engagement ratios on all of the days of that month instead of just the last day? Yes, the number is a moving average, but the numerator is the last 7 days and that would put a weight on the ratio where the last week of the month could potentially influence the monthly engagement ratio higher than other weeks.

Most months don't have major weekly fluctuation except November and December that have historically low activity levels on their last week.

Maybe something like:

MonthEngagementRatio[Month] = average(EngagementRatioOnDay[Month][1:28])
This seems like a good idea.  I would go a step further and say that we should not refer to a month's engagement ratio as "Month X's engagement ratio", but rather the "average engagement ratio for Month X" using the formulation :cmore has put forward.  The mean for the month also has the added benefit of being a statistic that can actually be used to compare month to month differences via inference tests, as we can also compute the variance for the month as well. 

:dzeber, care to comment?

(In reply to Chris More [:cmore] from comment #10)
> I like what I see in comment 6 from jjensen except for:
> 
> "- The Engagement Ratio on the last day of each month will be recorded as
> the Engagement Ratio for that month."
> 
> What if the we had the engagement ratio of any given month the average of
> all of the engagement ratios on all of the days of that month instead of
> just the last day? Yes, the number is a moving average, but the numerator is
> the last 7 days and that would put a weight on the ratio where the last week
> of the month could potentially influence the monthly engagement ratio higher
> than other weeks.
> 
> Most months don't have major weekly fluctuation except November and December
> that have historically low activity levels on their last week.
> 
> Maybe something like:
> 
> MonthEngagementRatio[Month] = average(EngagementRatioOnDay[Month][1:28])
Flags: needinfo?(dzeber)
+1 :rweiss. I agree on making sure we understand and call it the average engagement rate when we are thinking reporting the month numbers. If not, we could easily be using "engagement ratio" for different purposes and really confuse people. 

As long as we can measure the engagement rate each day and it is a rolling average, it can be aggregated by month, week, quarter, between a range of arbitrary dates, etc. For the reporting side of the number, it may be helpful to just write user stories (as a user, I want to do something, so that it provides benefit to me) for the different people that would consume some actionable insights from the ratio.
Assignee: nobody → mreid
Priority: P3 → P2
(In reply to Chris More [:cmore] from comment #12)
> +1 :rweiss. I agree on making sure we understand and call it the average
> engagement rate when we are thinking reporting the month numbers. If not, we
> could easily be using "engagement ratio" for different purposes and really
> confuse people. 
> 
> As long as we can measure the engagement rate each day and it is a rolling
> average, it can be aggregated by month, week, quarter, between a range of
> arbitrary dates, etc. For the reporting side of the number, it may be
> helpful to just write user stories (as a user, I want to do something, so
> that it provides benefit to me) for the different people that would consume
> some actionable insights from the ratio.

+1 from me too.

We've defined the Engagement Ratio as a daily metric, and we should make sure it is always presented as such. In particular, it's not officially defined for other time periods, but we extend it to cover other time periods by averaging over the time period. This is similar to how we've been thinking about ADI. It doesn't make sense to talk about a "monthly ADI", but we can compare "average ADI over the month" between months.

I think when I first read that sentence ("- The Engagement Ratio on the last day of each month will be recorded as the Engagement Ratio for that month.") I was thinking about MAU rather than Engagement Ratio. In this proposal we are defining MAU as covering a month-long window, but starting on any given day. In that case, it probably makes sense to think of the "MAU for a calendar month" as the MAU(day) computed on the last day in the month.
Flags: needinfo?(dzeber)
Hey, just commenting here on behalf of mobile since unified telemetry landed on Android in FX44. I just want to make sure we don't do this exercise twice and that we think of mobile too in this planning.

Another good reason to combine the two:

We need to steal a page from how Facebook reports on their DAU:MAU ratio. We need to look at it not only on a per platform basis but on a per user basis (yes, I know not everyone has a Firefox account). In the grand scheme of things, we want our users to use Firefox daily regardless of the device type they use. This will ultimately allow us to know how engaged they are with our product "ecosystem".

If we just look at device trends alone (Ceteris paribus), desktop DAU:MAU will likely go down and mobile DAU:MAU should naturally go up. Looking at the ratio per user would probably reveal it's flat.
I think the "overall" MAU and DAU calculations are clear based on this bug.

I have a question about the calculations for subpopulations. Do we expect a clientId to be represented in exactly one subpopulation? Or do we expect a clientId to be counted in each subpopulation in which it appeared during the DAU/MAU window?

For example, suppose I want to partition the population by update channel. If a clientId appears in today's data on both the "release" and "aurora" channels, should the DAU look like this:
2016-02-11 | release | 1
2016-02-11 |  aurora | 1

Or should we select a single subpopulation and assign this client to that population, for something like this:
2016-02-11 | release | 1

If we choose the former, we cannot simply add up the subpopulation counts to combine them (we'd have to re-compute the counts for any combination of populations).  If we choose the latter, what critera should we use to select the definitive facets for a given clientId?
(In reply to Mark Reid [:mreid] from comment #15)
> I have a question about the calculations for subpopulations. Do we expect a
> clientId to be represented in exactly one subpopulation? Or do we expect a
> clientId to be counted in each subpopulation in which it appeared during the
> DAU/MAU window?

I think we should count each client in exactly one subpopulation on any given day. If we don't, the summary numbers will be more difficult to compare and interpret. So the example would be counted as:

> 2016-02-11 | release | 1

> If we choose the latter, what critera should we use to select
> the definitive facets for a given clientId?

I recommend we just use the most recent value observed for that client's field on the given day. We have been using this logic with FHR for things like version/build updates that can have multiple values per day. Implicit is the assumption that the error caused by this logic is negligible across our large population, and in fact this something that we can now verify from the data.
(In reply to Dave Zeber [:dzeber] from comment #16)
> (In reply to Mark Reid [:mreid] from comment #15)
> > I have a question about the calculations for subpopulations. Do we expect a
> > clientId to be represented in exactly one subpopulation? Or do we expect a
> > clientId to be counted in each subpopulation in which it appeared during the
> > DAU/MAU window?
> 
> I think we should count each client in exactly one subpopulation on any
> given day.
What about for the 28-day MAU window?
Good point - we need a solution that makes sense for any time period.

The central issue is that we want to split clients into segments, but a single client can belong to multiple segments in a given time period. We have two main options for counting, each with different implications.

Option (1): Consider each client as belonging to exactly one segment for the entire time period.
- Total AU == sum of AU across segments
- Need to decide how to choose "the" segment when there are multiple segments
- Only need to compute AU for the most fine-grained segments. AU for higher-level segments is the sum of all sub-segments. For example, if we know AU across all combinations of (channel, distribution, geo), then AU for the release channel is the sum of AU over all the segments with channel == "release".
- For computation, selecting a single value when there are several may be more complex - eg. it may require finding the most recent ping for each client and over each time period.
- It is reasonable to consider the user as belonging to a single segment for the entire day, but it is probably not reasonable to assign the user to a single segment for an entire month.

Option (2): Count a client repeatedly for each segment in which it is active over the time period.
- Total AU != sum of AU across segments (since we could be double-counting)
- Need to compute AU separately for every segment and combination of segments we consider. It doesn't make sense to add AU across segments. For example, if we know AU across all combinations of (channel, distribution, geo), AU for the release channel still needs to be computed separately, since it is not the sum of AU over all the segments with channel == "release".
- Computation may be simplified, since we just need to run through pings and count segment occurrences. 
- Reasonable for both DAU and MAU.
- May be more difficult to interpret, since it is not additive across segments.

For MAU I think option (2) is the best, since it allows for the fact that users change between segments across the month.
For DAU I think either option works. Option (1) might be easier to work with, although option (2) would be consistent with MAU.

For example, consider the time right after a new version is released.
- We'd expect the engagement ratio for the old version, ER(d; old verison), to decrease as people update.
- The numerator, DAU(d; old version) counts people currently on the old version.
- The denominator, MAU(d; old version) counts people who have been on the old version at some point during the past month - this should include both those still on the old version and those who have updated to the new version.

Summary:
- Recommend (2) for MAU - count the client for each subpopulation in which it appears during the month.
- Recommend (1) for DAU - count the client in exactly 1 subpopulation on any given day. However, I think (2) is a valid alternative if others favour it, so long as we understand the caveats around additivity.

Thoughts?
I'd like to summarize once more the current concensus on our definitions for computing activity over segments.

- Segment: a combination of values across a set of UT fields (analogous to a WHERE clause in SQL). For example: (application.channel = "release" AND environment.os.name = "Windows_NT" AND environment.profile.creationDate >= 16801). 

- Active in a segment on a date (type 1): a client was active in a segment on a date d if, out of all "main" pings from that client for which payload.info.subsessionStartDate == d, the most recent subsession matches the segment conditions. The most recent subsession is the one whose info.profileSubsessionCounter is the highest out of all pings on date d.

- Active in a segment on a date (type 2): a client was active in a segment on a date d if there is at least one "main" ping from that client matching the segment conditions for which payload.info.subsessionStartDate == d.

  > The definitions of "active" rely on the fact that we expect to see at least one subsession for each calendar day on which the client was active.
  > Note that the date listed in the payload is in the client's local time - and this is the desired behaviour.

- DAU(d; segment): the # of distinct clients who were active (type 1) in the segment on date d.

- MAU(d; segment): the # of distinct clients who were active (type 2) in the segment on at least one date d' in the time period d-28 < d' <= d.

- ER(d; segment): mean(DAU(d'; segment) over d-7 < d' <= d) / MAU(d)

- MAU(calendar month; segment): MAU(last day in the calendar month; segment).

  > Alternatively, we could take the average MAU over all 28-day periods that fit into the calendar month. For example, if the month has 31 days, we could use MAU(yyyy-mm) = mean(MAU(yyyy-mm-28), MAU(yyyy-mm-29), MAU(yyyy-mm-30), MAU(yyyy-mm-31)).

- ER(time period longer than a day; segment) = mean(ER(d') over d' in the time period). 

  > In other words, ER is only defined for a single day. To compare ER over periods like weeks or months, we need to average over the period. This is the same as we have been doing with ADI.

The distinction between type 1 and type 2 active is as discussed in Comment 18. An alternative is to use type 2 active for both DAU and MAU.
(In reply to Dave Zeber [:dzeber] from comment #19)
> I'd like to summarize once more the current concensus on our definitions for
> computing activity over segments.
> 
> - Active in a segment on a date (type 1): a client was active in a segment
> on a date d if, out of all "main" pings from that client for which
> payload.info.subsessionStartDate == d, the most recent subsession matches
> the segment conditions. The most recent subsession is the one whose
> info.profileSubsessionCounter is the highest out of all pings on date d.
> 
> - Active in a segment on a date (type 2): a client was active in a segment
> on a date d if there is at least one "main" ping from that client matching
> the segment conditions for which payload.info.subsessionStartDate == d.

These definitions should also include one more item per the user story: the pings we are considering for "active on date d" must fall within d <= submissionDate <= d + 10days, where submissionDate is the server-assigned timestamp on which we received the ping from the client. 

This puts an upper bound on how long we wait for activity counts to become final (yay!).  It also means that we will be slightly under-counting over time, as we will miss activity that arrives more than 10 days after it happens, as well as data from clients whose clocks are consistently incorrect by >10 days.
(In reply to Mark Reid [:mreid] from comment #20)
> These definitions should also include one more item per the user story: the
> pings we are considering for "active on date d" must fall within d <=
> submissionDate <= d + 10days, where submissionDate is the server-assigned
> timestamp on which we received the ping from the client. 

I understood the user story to mean that we only trust the results when submissionDate >= d+10, but not that we should explicitly ignore late-arriving pings.

Should we be ignoring pings that arrive more that 10 days after the activity? For example, if I run a job to count activity in some specific segement in the first week of Jan 2016, would I exclude pings submitted after Jan 17?
Flags: needinfo?(benjamin)
For operational sanity, we do need to set a cutoff after which we stop caring about old activity. For now I've been setting that at 10 days, but we could do 14. As long as it's consistent it shouldn't matter.
Flags: needinfo?(benjamin)
Rebecca, what do you think about the type 1 vs type 2 approach as described in Comment 19?
Flags: needinfo?(rweiss)
Minor update: Hamilton and I have agreed on a data format and file naming convention for the "overall" DAU/MAU data, so I will begin with that, then move on to the subpopulations once we've agreed on an approach that works for everyone.
:bsmedberg: Quick question. Are you thinking if we agree upon a specific definition of the engagement ratio, that we should think about it across the Firefox platform including desktop and mobile? A lot of the conversation above feels desktop-centric and I understand the reasons why, but it would be nice to unify this for both desktop and mobile. Once all of mobile is on UT, this seems more possible technically, but for now at a high-level, having a standard cross-platform definition for reporting would be helpful.
Flags: needinfo?(benjamin)
All we should really need to make this cross-platform is a definition of "Active on a given date", and possibly segmenting, on each platform. If we can compute "# of distinct active users on a given date" or "in a given date range" on another platform, we can drop these into the definitions of DAU/MAU/ER as listed in Comment 19.
Yes, the goal is to have a general definition of engagement ratio and then a more specific definition for each product in terms of its particular data.
Flags: needinfo?(benjamin)
I spoke to :rweiss offline and the conclusion is to use Type 1 for DAU and Type 2 for MAU. It makes sense to select a single data point within a *day* for a given clientId, since it's relatively unlikely for anything interesting to change for that clientId, but within a period of a whole month (or 28 days as we've defined it above), it is fairly likely for some interesting dimension to change (say, Firefox Version), and we should count those populations independently.

The upshot is that I'll need to pre-compute each subpopulation we're interested in up front during the aggregation job, and that those subpopulations will not be additive to sum up combinations of subpopulations.
Flags: needinfo?(rweiss)
Points: --- → 3
Priority: P2 → P1
Depends on: 1253392
Attached file MauDau.ipynb
Here is the notebook for computing the overall MAU and DAU as defined in this bug. It also relays the resulting data to the Firefox Dashboard.
Attachment #8732131 - Flags: review?(rvitillo)
Attachment #8732131 - Attachment mime type: text/plain → application/json
The overall engagement ratio is complete (pending review of the attached notebook). I've filed bug 1257511 for capturing the definitions in this bug as documentation.

I would like to handle the querying of subpopulations using the HyperLogLog approach described in bug 1253644, which should let us count unique clientIds for arbitrary combinations of dimensions in a performant way.
Here's a Gist link to the notebook for easier viewing:
https://gist.github.com/mreid-moz/8b2c2b1c6594d658ca5e
(In reply to Mark Reid [:mreid] from comment #30)
> I would like to handle the querying of subpopulations using the HyperLogLog
> approach described in bug 1253644, which should let us count unique
> clientIds for arbitrary combinations of dimensions in a performant way.

Any particular reason why we can't handle the whole population with the approach described in Bug 1253644?
We can, that's what I want to do.
Oh, you mean instead of this notebook.  We can potentially do that too, especially now that we have this data to validate against.
Attachment #8732131 - Flags: review?(rvitillo) → review+
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
For posterity, the HLL based dashboard can be viewed at https://sql.telemetry.mozilla.org/dashboard/firefox-client-count.
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.