Closed Bug 1493595 Opened 6 years ago Closed 6 years ago

Traffic Cop Pinning to Win10 taskbar A/B testing

Categories

(www.mozilla.org :: Analytics, enhancement)

Production
x86_64
Windows 10
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RT, Assigned: craigcook, NeedInfo)

References

Details

Attachments

(1 file)

Context: We're investigating the impact that placing a Firefox shortcut on the Windows 10 taskbar would have on engagement and retention. The hypothesis being that users find it more convenient to open Firefox from the taskbar in a context where pinning shortcuts to the taskbar is not something most users know how to do. This bug is for the Optimizely set-up on the download page to support the test. Start date: TBC Channel: Release 63 Feature document: https://docs.google.com/document/d/1zhj1GWtZoUTCsAlJ2i0hdgxkAFdpK4lT6QNygRtIgCI/edit Addressable population: New Windows 10 users on 64 bit OS with en-US locale (test and control cohorts)
Depends on: 1493597
Hi Jon, I'm reaching out since you helped us with the Win64 funnelcake (bug 1309847) and I'm unsure who should be triaging these traffic cop funnelcake requests. We'd like to ship this funnelcake to end-up having 150k profiles across both test and control funnelcakes and we need help with traffic cop config: - Enrollment of users over a 7 day period (random sampling) to control for week seasonality (New Windows 10 users on 64 bit OS with en-US locale (test and control cohorts) - We need to end-up with at least 150k profiles across both test and control, do you have a rough idea of how many downloads are needed to achieve this?
Flags: needinfo?(jon)
Hey Romain - I'm not too versed in planning the funnelcakes - just implementing their release through Traffic Cop experiments. I'm cc'ing Eric Renaud, the PM for the websites durable team. He should know who can help get this process started.
Flags: needinfo?(jon) → needinfo?(erenaud)
Thanks Jon. Eric, can you please also clarify what the timelines are typically so we can build that into our plan?
Hi Romain - per email we need to pull in Daniel Z to help with finalizing the experiment.
Flags: needinfo?(erenaud) → needinfo?(dzielaski)
Please help me understand the marketing analytics that are available for this experiment versus the data science team? Thank you!
Flags: needinfo?(rweiss)
(In reply to Marissa Morris from comment #5) > Please help me understand the marketing analytics that are available for > this experiment versus the data science team? Thank you! For info Su (on Rebecca's team) is helping us through bug 1506648. He helped us defined cohort sizes and acquisition period though this bug is about exposing the funnelcake downloads through the bouncer. Eric and Daniel, per e-mail exchange this ask is about defining, based on the 7 day acquisition period (advised by Su on https://docs.google.com/document/d/1zhj1GWtZoUTCsAlJ2i0hdgxkAFdpK4lT6QNygRtIgCI/edit#), which percentages we'll need to route those New Windows 10 users on 64 bit OS with en-US locale to reach the having 150k profiles for each cohort. So we need at least 21.4k users acquired daily over 7 days - based on current en-US download rates and install/download ratios you see on this locale can you please help confirm the percentage of en-US downloads we need to route to the funnelcake during the 7 day acquisition period? Alex would then configure traffic cop accordingly.
I'll defer to Daniel on the cohort sizing for this experiment. Please note: we're currently in a very busy time with major campaigns shipping. I can't guarantee I'll have time to get to this before all-hands.
Thank you for the update on your capacity Alex. If at all possible, it would be great if we could get some traction on this as it's the only viable thing we have going to close the 3 million user aDAU gap at the moment. If not, what's our next best target? Thank you.
I'm that sure someone on our team will have capacity to pick this up right after all-hands.
Hi team, Sounds cool and exciting. Marketing analytics is happy to dive in and support. I met w/ Eric R and David T this AM to chat through the details and I think we have a OK handle of the null and alternative hypotheses. We might want to invest in adding a little rigor to the hypotheses. Maybe something like - - NULL (framed in the negative, this is what we're trying to refute) Placing a Firefox shortcut on the Windows 10 taskbar will not create a statistically significant (p<=0.05) and experimentally important positive difference in 28-day cohort retention (>=XX%) and number of active days within a 28d period (>=XX active days). I've confirmed that we'll be able to identify the clients that receive the treatment in telemetry and we'll then be able to evaluate the number of active days they create over some fixed period (maybe a month?). As far as sample size, I've heard the 150k client number - which frankly seems pretty large. Is there any science/math behind that figure or is it just a conservative estimate? My recomendation for sample size would be about one order of magnitude smaller, or 16,587 clients per group (not downloads, but clients showing up in telemtry). This estimate is based on the assumption that the active day distribution is relatively normal, we can tolerate very little uncertainty (99% confidence level), and we can tolerate very little margin for error (1%). If this group feels like any of these assumptions are flawed we can chat and change the required sample size. Or, if we feel like we need to stratify the sample in any way we'll need to increase the complexity of my method. Xuan Luo (Data Scientist) and Edward Cho (Data Engineer) on the Marketing Analytics team will help to evaluate the hypotheses after the experimental window closes. I'll support w/ the write up. DZ
Flags: needinfo?(xluo)
Flags: needinfo?(echo)
Flags: needinfo?(dzielaski)
(In reply to Daniel Zielaski from comment #10) > Hi team, > > Sounds cool and exciting. Marketing analytics is happy to dive in and > support. > > I met w/ Eric R and David T this AM to chat through the details and I think > we have a OK handle of the null and alternative hypotheses. We might want to > invest in adding a little rigor to the hypotheses. Maybe something like - - > > NULL (framed in the negative, this is what we're trying to refute) > Placing a Firefox shortcut on the Windows 10 taskbar will not create a > statistically significant (p<=0.05) and experimentally important positive > difference in 28-day cohort retention (>=XX%) and number of active days > within a 28d period (>=XX active days). > > I've confirmed that we'll be able to identify the clients that receive the > treatment in telemetry and we'll then be able to evaluate the number of > active days they create over some fixed period (maybe a month?). > > As far as sample size, I've heard the 150k client number - which frankly > seems pretty large. Is there any science/math behind that figure or is it > just a conservative estimate? > > My recomendation for sample size would be about one order of magnitude > smaller, or 16,587 clients per group (not downloads, but clients showing up > in telemtry). This estimate is based on the assumption that the active day > distribution is relatively normal, we can tolerate very little uncertainty > (99% confidence level), and we can tolerate very little margin for error > (1%). If this group feels like any of these assumptions are flawed we can > chat and change the required sample size. Or, if we feel like we need to > stratify the sample in any way we'll need to increase the complexity of my > method. > > Xuan Luo (Data Scientist) and Edward Cho (Data Engineer) on the Marketing > Analytics team will help to evaluate the hypotheses after the experimental > window closes. I'll support w/ the write up. > > DZ Thanks Daniel. Su helped size the cohorts (150k profiles across both cohorts) based on our detection requirements (detect 1% change in retention or 5% change in engagement) - Su can you please help clarify the logic behind the 150k number?
Flags: needinfo?(shong)
Hi Daniel, The sample size estimation is based on: * estimated 0.31 baseline retention rate (https://sql.telemetry.mozilla.org/queries/60054/source) * we want to detect a 0.01 change between the branches * assuming alpha of 0.05 Using a 2 sample difference in proportion test, we have these sizing estimations: Ns retention_power 0 10000 0.19 1 20000 0.34 2 30000 0.47 3 40000 0.58 4 50000 0.68 5 60000 0.75 6 70000 0.82 7 80000 0.86 8 90000 0.90 9 100000 0.93 10 110000 0.95 11 120000 0.96 12 130000 0.97 13 140000 0.98 14 150000 0.99 15 160000 0.99 16 170000 0.99 17 180000 1.00 The 150,000 total profiles estimate is based on 99% power. Let me know if you have any questions. - Su
Flags: needinfo?(shong)
Hi team, The above seems logical given the focus on retention. Xuan and Edward are checking to see if/when we'll expect to have enough clients based on Su's recommendation. DZ
Just seeking clarification of the bedrock dev task here: According to bug 1493597, the funnelcake IDs are 138 and 139, where one is the taskbar test and one is the control. Is this correct? And with the two builds does that imply a simple 50/50 split of traffic? (Understanding that we're already narrowing the pool to Windows 10 users of the en-US locale). At that rate, what's the projection for how long it should take to reach the 150k target number, at which point we would disable the experiment?
Assignee: nobody → craigcook.bugz
(In reply to Craig Cook (:craigcook) from comment #14) > According to bug 1493597, the funnelcake IDs are 138 and 139, where one is > the taskbar test and one is the control. Is this correct? I can speak to this part - 138 is the control, 139 has the taskbar pinning.
(In reply to Daniel Zielaski from comment #13) > Hi team, > > The above seems logical given the focus on retention. > > Xuan and Edward are checking to see if/when we'll expect to have enough > clients based on Su's recommendation. > > DZ Daniel, did you get confirmation from Xuan and Edward? Alex is back from PTO on the 17th and I'm hoping we can ship this next week.
Xuan, Su, and myself are meeting today to compare notes and reach a final duration for running the experiment.
Flags: needinfo?(echo)
Edward, please set a needinfo here for Craig Cook when the duration has been calculated. We also need to know if this is a straight 50/50 of 100% of the qualifying traffic (New Windows 10 users on 64 bit OS with en-US locale) or if the cohort needs to be a percentage of that based on the original ask (comment1) in which Romain stated the requirement for the test to run for 7 days.
Flags: needinfo?(echo)
:erenaud, I got the download numbers from xuan and edward (clearing their needinfos). I've calculated the download sampling requriements here: https://docs.google.com/document/d/1Wzz7oiVpV4ZCC6ZWY8R_nPyjLruNQrFmqUSp0s2MTCw/edit We will need, for Win10, en-US downloads[1]: over a 1 week period: 18% of download traffic to funnelcake 138 18% of download traffic to funnelcake 139 OR if we spread out the enrollment over a 2 week period[2]: 9% of download traffic to funnelcake 138 9% of download traffic to funnelcake 139 :erenaud, can we set up a meeting with whoever will be turning on this funnel cake experiment/handling traffic cop for us. I'd like to get some clarity on how that process works and make sure we're on the same page. - Su [1]: I'm not including 64bit OS (which is a requirement) because according to my conversations with Xuan and Edward, we cannot identify / target 64bit OS users / downloads from our website. If this is indeed the case, we will be serving some users who are on 32bit OS 64bit Firefox (the funnelcakes we're serving are 64bit). I've calculated that should be 3-4% of downloads, just wanted to call that out here. [2]: I'm not sure if there's any issues/restrictions with sampling/serving funnelcakes to such a large portion of our downloads.
Flags: needinfo?(xluo)
Flags: needinfo?(erenaud)
Flags: needinfo?(echo)
Considering the holidays, I would recommend running the test for two weeks at the lower sample rate, just so we don't have to worry about switching off the experiment during the week between Christmas and New Year when most people aren't working. We would start it running this week and turn it off the first week of January.
(In reply to Su-Young Hong from comment #19) > [1]: I'm not including 64bit OS (which is a requirement) because according > to my conversations with Xuan and Edward, we cannot identify / target 64bit > OS users / downloads from our website. > > If this is indeed the case, we will be serving some users who are on 32bit > OS 64bit Firefox (the funnelcakes we're serving are 64bit). I've calculated > that should be 3-4% of downloads, just wanted to call that out here. We actually can detect 64bit, and thus exclude 32bit from the test. Do we need to adjust the percentages for that?
(In reply to Craig Cook (:craigcook) from comment #21) > (In reply to Su-Young Hong from comment #19) > > > [1]: I'm not including 64bit OS (which is a requirement) because according > > to my conversations with Xuan and Edward, we cannot identify / target 64bit > > OS users / downloads from our website. > > > > If this is indeed the case, we will be serving some users who are on 32bit > > OS 64bit Firefox (the funnelcakes we're serving are 64bit). I've calculated > > that should be 3-4% of downloads, just wanted to call that out here. > > We actually can detect 64bit, and thus exclude 32bit from the test. Do we > need to adjust the percentages for that? That's good! I don't think we need to adjust the percentages, if we can target 64bit OS, then our the population we are sampling from will be smaller so it should work itself out. Great, I'm okay with a 2 week sampling period. Can we start this week? How about start Dec 18th (tomorrow) - Dec 31st?
Flags: needinfo?(erenaud)
Plan is to flip the switch to turn on the test on Thursday Dec 20th and run through til the full two weeks are up and we have sufficient numbers - or extend if we don't, per direction from Su-Young at that time.
Flags: needinfo?(shong)
thanks eric. yes, that is correct. The plan is turn on the switch Thurs Dec 20th[1][2] and run for 2 weeks (turning off on Jan 2nd). I will monitor enrollment so if we are not getting enough users, we can extend after if needed (I will raise the issue if that is the case). Please use the 2 week enrollment numbers: * 9% of traffic to funnelcake 138 * 9% to funnelcake 139 [1]date chosen so that someone would be available to turn off the experiment on Jan 2nd [2]it will be over holidays but we're spreading it out over 2 weeks to mitigate
Flags: needinfo?(shong)
This experiment is now active in production: https://www.mozilla.org/en-US/firefox/new/
I know most people are still on holiday but wanted to raise this issue now so we're aware when we return. There seems to be issues with enrollment: DAU from profiles with distribution_id = 'mozilla138' or 'mozilla139' - https://sql.telemetry.mozilla.org/queries/60716/source Telemetry is only reporting 1 profile with the distribution IDs for funnelcakes 138 and 139 (mozilla138 and mozilla139). The questions I think we need to start digging into when we return are: * Are the funnelcakes getting served/downloaded? (check in GA) * Are there the funnelcakes installing properly? * Are profiles from the funnelcake installations tagging in distribution_id properly?
Depends on: 1517138
Something is broken here, in addition to Comment 26, bug 1517138 was just reported.
I have disabled this experiment until we can uncover the source of the breakage here via https://github.com/mozmeao/www-config/commit/7e26e7ff984387020d5fce2a59f411e522d3a40d
Ok, it looks like we're targeting 64bit users in the Traffic Cop config here: https://github.com/mozilla/bedrock/blob/master/media/js/firefox/new/experiment-win10-taskbar-funnelcake.js#L13 But I think this is falling down in 2 places: 1. We're specifying `win` in addition to `win64` for `FUNNELCAKE_PLATFORMS` in the config https://github.com/mozmeao/www-config/blob/master/waffle_configs/bedrock-prod.env#L12-L15 2. Mozorg doesn't expose 64bit builds in download buttons by default, since (iiuc) the stub installer handles serving the correct binary to Windows users. The end result is that 64bit users on Windows 10 get a 32bit build of the funnelcake which doesn't exist. I'll liase with Craig on this, but it sounds like we should probably fix 1. and then workaround 2. by serving the correct build to 64bit users.
PR to set funnelcake to win64 only: https://github.com/mozmeao/www-config/pull/178
Bedrock PR to serve the 64bit build to users who enter into the experiment: https://github.com/mozilla/bedrock/pull/6664
hey Alex, Lets pause re-activating the experiment for now (until after we meet to discuss next steps). - Su
Flags: needinfo?(craigcook.bugz)
Flags: needinfo?(agibson)
(In reply to Su-Young Hong from comment #34) > hey Alex, > > Lets pause re-activating the experiment for now (until after we meet to > discuss next steps). > > - Su Understood, thanks.
Flags: needinfo?(craigcook.bugz)
Flags: needinfo?(agibson)
(In reply to Su-Young Hong from comment #34) > hey Alex, > > Lets pause re-activating the experiment for now (until after we meet to > discuss next steps). > > - Su Hi Su. Is there a conversion or a meeting to discuss next steps or what is specifically gating re-activating the experiment after agibson fixed the 404 issue? Thanks!
Flags: needinfo?(shong)
(In reply to Chris More [:cmore] from comment #36) > (In reply to Su-Young Hong from comment #34) > > hey Alex, > > > > Lets pause re-activating the experiment for now (until after we meet to > > discuss next steps). > > > > - Su > > Hi Su. Is there a conversion or a meeting to discuss next steps or what is > specifically gating re-activating the experiment after agibson fixed the 404 > issue? > > Thanks! We met with the team yesterday and per https://bugzilla.mozilla.org/show_bug.cgi?id=1493597#c40 I'm checking with Nick about creating a 32 bit build that would fix the issue we encountered (inability to reliably detect 64 bit OS). This will although also require another round of QA.
Ah, got it. Thanks for the update! We had to something like this before for the same reason to not create a 404 condition for builds for specific audiences.
Flags: needinfo?(shong)

For info we're moving forward with a stub-based solution, please see details on https://bugzilla.mozilla.org/show_bug.cgi?id=1493597#c43
Su, can you please validate this looks fine from an analytics standpoint?

Flags: needinfo?(shong)

The custom stub installers work in Bug 1493597 is now ready for testing. I spoke to craigcook on slack about re-enabling the Traffic Cop in staging, and also re-enabling it for win32 requests. Plan is to do that on Friday ('zilla std time).

:rt,

Yes, that solution looks great to me. From an analysis standpoint, this works perfectly.

Follow up question, will this custom stubinstaller be sending install pings like normal stubinstallers do? And if so, where? That data will be useful for the experiment. (I'll also ask in the funnelcake bug).

:craigcook, just to confirm, so we'll be targeting:

  • Windows 10
  • en-US

users on the website, sampling 9% for 138 stubinstaller, 9% for 139 stubinstaller, and then the stubinstaller will determine:

  • if 32 bit OS, install regular Firefox
  • if 64 bit OS, install 13X funnelcake

is that a correct understanding?

Thanks!

  • Su
Flags: needinfo?(shong) → needinfo?(craigcook.bugz)

(In reply to Su-Young Hong from comment #41)

:craigcook, just to confirm, so we'll be targeting:

  • Windows 10
  • en-US

users on the website, sampling 9% for 138 stubinstaller, 9% for 139 stubinstaller, and then the stubinstaller will determine:

  • if 32 bit OS, install regular Firefox
  • if 64 bit OS, install 13X funnelcake

is that a correct understanding?

Correct.

However, I now realize we'll need to make a minor change in our traffic cop setup since we were previously detecting win64 and now we need to remove that condition. I can make that update and we can get it into production first thing Monday. Then we can reactivate the experiment on stage to test and verify, and if all is well we can also activate it in production on Monday.

Flags: needinfo?(craigcook.bugz)

thanks Craig!

that works for me.

:rt, does that work for you?

thank you guys for moving forward on a solution for this!

(In reply to Su-Young Hong from comment #43)

thanks Craig!

that works for me.

:rt, does that work for you?
Good with me! Let's watch closely user acquisition numbers as this roll-out to validate acquisition is fine early on.

Craig re-enabled traffic cop on stage www.mozilla.org and I've done some testing. Everything seems to be working fine (see below) and I'm happy for us to re-enable on prod too. Is there any reason not to go ahead at this point ?

Test method:

  • 64 or 32 bit Windows 10 VMs with matching processors, Google Chrome
  • repeat:
    • open a new Incognito window (to avoid traffic cop cookie)
    • load https://www.allizom.org/en-US/firefox/new/
    • if the URL now has a ?f=138 or ?f=139 suffix then
      • click on the green 'Download Firefox' button
      • run the stub installer
      • verify the install using Help > About Firefox
        • on 64 bit it'll mention Funnelcake and is a 64 bit build; 139 pins the Firefox icon on the taskbar
        • on 32 bit it won't mention Funnelcake and is a 32 bit build

The 9% each selection seems to be working, broadly speaking, as the additional query arg is uncommon, and roughly the same frequency. Couldn't get Windows 7 to give me a funnelcake in ~40 attempts.

I could not get a URL with a ?f=138 or ?f=139 suffix when loading https://www.allizom.org/en-US/firefox/new/ in an incognito Window and pressing refresh about 50 times, both on Chrome and Firefox - could it be because I'm in France? It should still be working given we target en-US regardless of geography but maybe traffic cop got disabled on stage?

I also re-ran Su's query (https://sql.telemetry.mozilla.org/queries/60716/source) which shows a few more DAU of the funelcake as of yesterday.

I'm good with enabling on prod assuming the reason I cannot test is traffic cop got disabled or it did something wrong when testing.

You also need to ensure Do Not Track is OFF. We don't serve the Traffic Cop script to browsers with DNT enabled.

I found that I had to create a new private window for each request rather than refresh in a single one. Apparently there's a cookie set on the first request which has to be removed somehow (clearing data would also work).

Craig enabled Traffic Cop in prod approximately Tuesday, 15 January 2019 00:00 UTC. I see it working there.

How's that going compared to the plan, are we getting sufficient enrolments ? Firefox 65.0 will ship on January 29 and it would be good to know if we want to wrap up distribution before then, or if the release is an opportunity when download traffic is higher (or different type of user, etc). There's a Releng and informal-QA cost to using 65.0 based builds - some checks around installer changes, and that distribution still works OK after bouncer changes.

Flags: needinfo?(shong)

Our target was 150k profiles enrolled over 14 days
If I read Su's graphs right, between Jan 15th and Jan 20th (6 days) we enrolled 42500 profiles - i.e 7083 daily
If we keep this same pace over the next 9 days (getting us to January 28th as last enrolled day) we'd get another 63700 profiles - i.e 106200 profiles total (30% off of our enrollment target) .

Is that an option to increase the enrollment rate (currently 9%) to avoid releng/QA costs related to using 65.0 based builds?

Hmm, I read the first graph as cohort size of ~8K each on Jan 20, now ~12k on Jan 21 data, which is a lot further away. Seems like an adjustment to traffic cop would be needed either way.

+1 romain's comment. We're enrolling (or having less enrollees end up in Telemetry) at a lower rate then we projected.

Currently, we have about 50K total enrollees.

We might want to bump up enrollment (maybe 14% each branch, eyeball projection here) to achieve our desired enrollment size.

  • Su
Flags: needinfo?(shong)

(In reply to Su-Young Hong from comment #54)

We might want to bump up enrollment (maybe 14% each branch, eyeball projection here) to achieve our desired enrollment size.

I can change the sample rate. Do you want to go with 14% or do you need to crunch some numbers to be more precise?

I just crunched some numbers, we're at about 50K total. Assuming 7K a day, that'll give us another 49K by the 29th. Since we're targeting about 150K, I think we'll need to double our current rate to hit that.

If nobody has any objections, lets go 18% each branch (18% to control, 18% to variant).

  • Su

Agreed let's do that quickly so we avoid the releng/QA complexity

The rate increase is active in production now. You should probably make a note in your analytics of the date it increased. We can always raise it again if it looks like we still won't hit 150k before Fx65 ships.

Looks like we've shipped 65.0, but are still distributing 64.0.2 based funnelcake to 18%+18% of requests ? I don't see a recent commit in bedrock anyway.

If it turns out we don't have a large enough cohort we could always verify the 65.0 based builds and re-enable.

Flags: needinfo?(craigcook.bugz)

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #61)

Looks like we've shipped 65.0, but are still distributing 64.0.2 based funnelcake to 18%+18% of requests ? I don't see a recent commit in bedrock anyway.

The code is still in bedrock, but controlled by a switch which is currently set to "OFF" so the experiment is suspended. See https://github.com/mozmeao/www-config/commit/31e8df10d47c3e1d5fdd87af67b198430f56caf6

If it turns out we don't have a large enough cohort we could always verify the 65.0 based builds and re-enable.

Su said we had about 136,600 at the time the experiment was halted, so a bit short of the 150k target but maybe it's enough? If you want to produce 65.0 builds, just say the word and we can flip the switch back on. I imagine it wouldn't take long at all to hit 150k total at the 18%/18% sample rate.

Flags: needinfo?(craigcook.bugz)

Great to hear suspension already happened thanks.

I will wait to hear if we want to gather more users. The builds were already created by automation, but I'd need to check for any installer changes in 65.0, and update bouncer to point to them.

Hey, sorry for late comment, but yes, wanted to confirm what Craig said.

To update, it looks like we have about 157,534 enrollees reporting telemetry so we have what we need r.e. population size.

Thanks all for all the hard work on this!

Experiment concluded with enrollees reporting telemetry. Seems all good, closing.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: