Closed Bug 1301188 Opened 8 years ago Closed 7 years ago

[SHIELD] Study Validation Review for Activity Stream Validation Experiment

Categories

(Shield :: Shield Study, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: tspurway, Unassigned)

References

Details

Attachments

(1 file)

We are currently running the Activity Stream addon in Test Pilot, and are collecting engagement and retention data (https://sql.telemetry.mozilla.org/dashboard/activity-stream-executive-summary).

In order to validate this as a Firefox replacement for existing about:newtab functionality, we need to:
- run the addon on a release population 
- collect additional data on the existing about:newtab Tiles pane in order to establish an experimental control
- compare engagement and retention data across the Tiles and Activity Stream branches of the about:newtab page in the experiment

There are exactly two variants (branches) for this experiment.  One branch is that the existing about:newtab page run unchanged from the current (Tiles) experience.  The other branch would be to offer the Activity Stream addon as the default about:newtab.  It is necessary to run the Tiles variant as part of the experiment so that we can augment the data collection for this pane.  In particular, we would need to collect a telemetry ID to uniquely identify the user in order to compute MAU/DAU.  I have no insight into whether the distribution of the population over these variants should be anything other than 50-50.

The experiment data will be analysed to compare various engagement and retention metrics.  It would seem rational to compare clicks on 'frecent' Tiles with clicks on Activity Stream top sites, to compare searchbox usage in both variants, as well as comparing distributions of invocations of about:newtab for each branch.

We would also like to ask if there is any difference in Firefox user retention across both variants.

Our hypothesis is that the Activity Stream user experience is significantly more engaging than Tiles, and that the user retention is either positive or neutral for Activity Stream users.  We are looking for guidance in framing the effect thresholds for the experiment.

The current population for Activity Stream in Test Pilot is 10,000 DAU.  I am not sure how large the sample size would need to be for the Shield experiment.

Because a MAU calculation is central to our analysis, I would think that one month would be the bare minimum running time for this experiment.

I have attached a CSV of historical data with click-through-rates (CTRs) of 'frecent' tiles on the existing about:newtab page to be used as a rough baseline.  Note that, unlike the proposed experiment, these data are not normalized across unique users.

I do not think we need to run the experiment as a moco-wide experiment, as we have already run it in Test Pilot.

r? :rweiss
Attachment #8789049 - Flags: review?
Flags: needinfo?(rweiss)
There are two areas that need clarification before I can sign off:

1) We need effect size decisions (aka "success criteria") and we need corresponding power tests for those effect sizes.

2) I'm not sure MAU is the right outcome measure, and you might find a specific week's retention rate to be a more convincing outcome. 

The success metrics seem to be designated around the following:

- Number of new tabs opened (bigger is better)
- Amount of searchbox usage (bigger is better)
- Clickthrough rate (higher is better)
- Number of MAU (more is better)
- Retention rate (higher is better)

For each of these areas, the appropriate sample size can be done as a power calculation for a given effect size for the "bigger" in each of those success criteria.  For example, if we are looking for an improvement in MAU or DAU, we should describe a threshold amount ("We want to see an increase in retention by X%/We should see N more MAU/etc").  The difference in whatever that amount is can be used to perform a basic power test.

However, it can be tricky to select an effect size.  The clickthrough rates look like they're around 3-5%, correct?  So what percentage increase in CTR is the minimum you're willing to accept as "ground truth" change?  1%?  0.01%?   This is a product manager question, really; do you think being able to say "we saw an increase of 0.01% on CTR" is a convincing enough story to say this is a meaningful improvement?  (Fwiw, a 0.01% would require a massive sample size with back-of-envelope calculations, whereas a 1% is potentially more reasonable).

I'm leery of the MAU calculation outcome, but a similar power test should be calculated there as well.  The power test for absolute counts as opposed to rate changes is a little different.  Again, what's the minimum threshold for MAU that we're trying to test for?  An increase of 1000?  An increase of 10000?  An increase of 100?

I suspect that an Nth week retention outcome might be a better comparison for effect, if only because it is more directly targeting the "retention" desire.  For example, instead of MAU, you might find the "9th week retention rate" more convincing a retention outcome analysis than simple MAU.  I am basing this simply on the chart in the executive summary link above, where (at the time of writing this bug) the 9th week seems to be the 50% dropoff point for the past few weeks.  For what it's worth, we can track both measurements, but we would have to run the experiment for 9-10 weeks instead of one month.

:nchapman, can you speak to the specific effect size comments above?  Specifically to those bulletted items above?
:emtwo, when we see nchapman's endorsement, can you provide a power test for those effect sizes?  
:glind, this seems like a straightforward A/B test with various inference tests.  Can you verify that the experiment design will allow us to test the "engagement effect" of Activity Stream as defined in the previously mentioned engagement dimensions (click through rate and so on).
Flags: needinfo?(rweiss)
Flags: needinfo?(nchapman)
Flags: needinfo?(msamuel)
Flags: needinfo?(glind)
One last comment: we could test *each week's* retention rate if we wanted.  Again, it's a product manager question.  I suspect the 1/7/14/30 day retention rate breakpoints are probably also good tests in addition to 9th week retention rates. 

Consider the following: if we're trying to improve the early entry points, we will care more about the retention rate in the early days of usage.  It could be that the "increased engagement" effect is only about the early hours/days of Activity Stream usage, and that we lose that increased engagement over a long enough window of time.  That doesn't mean that Activity Stream *wasn't* more engaging, it just might not be engaging enough over time.  This is mostly about establishing many targets that can be used to describe "success."
If it helps at all, we can grind the Activity Stream metrics from our Test Pilot experiment to baseline realistic expectations around 1/7/14/30 day retentions for the Shield experiment
All the things below

tl;dr - valid design as is, but seems like we can grow a little more on it.  Worries about sample size / low power.  Indirect detection of user 'feel'.

Longer:

1.  Agree that the design is an A/B with intervention / observe-only

    a.  One problem in this design *for this very visible feature* is that A/B here conflates "doing something" with "doing this specific thing".  I am tempted to suggest some sort of "sham treatment" for the A arm.  It's possible that enrolling in the study is sham treatment enough.

2.  Echoing rweiss:  I am worried about the effect size and power here as well.  I predict this feature is a 'grower', and that if anything, there might be a slight DIP due to unfamiliarity to counter-act the slight RISE due to novelty effect.

3.  Unexplored pieces / research targets
   
    a. Using this to map more release-population user stories for new tab page / tiles / stream.  

       (Using a survey, or other design tool to elicit)

    b.  Measure 'awesomeness' more directly... have some 'share this / tell your friends' engagement points.  Heartbeat style ratings?  I think these are more likely to vary than the mau/dau stuff.

4.  Some specifics on outcomes:

- Number of new tabs opened (bigger is better)
  
  (I predict this is unlikely to change)

- Amount of searchbox usage (bigger is better)

  (I predict this is unlikely to change, and test here should be '1-sided'... powered for 'no loss')

- Clickthrough rate (higher is better)
- Number of MAU (more is better)
- Retention rate (higher is better)
Assignee: nchapman → nobody
Flags: needinfo?(glind)
(In reply to Rebecca Weiss from comment #1)
> There are two areas that need clarification before I can sign off:
> 
> 1) We need effect size decisions (aka "success criteria") and we need
> corresponding power tests for those effect sizes.

Here are the success criteria we have decided upon for our first shield study:

1) Increase in Topsite clicks per user per day
2) Increase in Searches per user per day
3) Increase in new tab sessions per user per day

The power analysis is here (bottom 3 graphs): https://gist.github.com/emtwo/170201de6063d052a73a8af7beae4bb8

It looks like we need roughly 8000 users per arm based on that analysis but I recommend going for 10,000 to have some buffer.


> 2) I'm not sure MAU is the right outcome measure, and you might find a
> specific week's retention rate to be a more convincing outcome. 

Our first study will be run for two weeks only so we will not be focusing on any retention outcomes.
Flags: needinfo?(msamuel)
Adding needinfo for Rebecca requesting sign-off.
Flags: needinfo?(rweiss)
(In reply to Marina Samuel [:emtwo] from comment #5)
> (In reply to Rebecca Weiss from comment #1)
> > There are two areas that need clarification before I can sign off:
> > 
> > 1) We need effect size decisions (aka "success criteria") and we need
> > corresponding power tests for those effect sizes.
> 
> Here are the success criteria we have decided upon for our first shield
> study:
> 
> 1) Increase in Topsite clicks per user per day
> 2) Increase in Searches per user per day
> 3) Increase in new tab sessions per user per day

After talking to Nick Chapman I thought we were shooting for "do no harm" for searches, not looking to increase search volume. If you state an increase as a success metric, we'll hold you to it ;) Nick?


> It looks like we need roughly 8000 users per arm based on that analysis but
> I recommend going for 10,000 to have some buffer.

8k-10k per arm is going to be extremely difficult. It may take 2+ months to fill that enrollment. 

> 
> 
> > 2) I'm not sure MAU is the right outcome measure, and you might find a
> > specific week's retention rate to be a more convincing outcome. 
> 
> Our first study will be run for two weeks only so we will not be focusing on
> any retention outcomes.
> After talking to Nick Chapman I thought we were shooting for "do no harm"
> for searches, not looking to increase search volume. If you state an
> increase as a success metric, we'll hold you to it ;) Nick?

I think the hope is that people use Activity Stream more than the current tiles/newtab page overall which means that they'd end up doing more searching inside of Activity Stream. Though we make no claim about impact of overall search in Firefox.
I think this is ready for an initial investigation.  Signing off (and clearing my earlier NI for nchapman for tidiness).
Flags: needinfo?(rweiss)
Flags: needinfo?(nchapman)
We are going to re-run this experiment with the same variants, same design, and same expected outcome.  

r? :rweiss (do you need any additional info/data?)
Flags: needinfo?(rweiss)
r+ (no additional info needed)
Flags: needinfo?(rweiss)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: