Closed Bug 1510981 Opened 7 years ago Closed 6 years ago

Data science support for bootstrap process pref flip study

Categories

(Data Science :: Experiment Collaboration, task)

x86_64
Windows
task
Not set
normal
Points:
3

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: RT, Assigned: wbeard)

References

()

Details

Attachments

(1 file)

Brief description of the request: The Windows launcher process intends to make the dll blocklist reliably work on startup per bug 1435780 In order to do that, firefox.exe is modified such that its initial invocation is a bootstrap process that creates the "real" browser process in such a way that its injection blocking capabilities are readied before the browser's main thread ever starts. Given the wide range of potential breakages related to 3rd party softwares manual QA cannot cover all possible scenarios and we want to gain confidence that this positively affects user experience by A/B testing through a pref flip study in order to validate non negative retention or engagement impact. Link to any assets: Draft PHD (not yet submitted https://docs.google.com/document/d/1XMxWL5XcpeJoT4oL23BNV3RiMnj-gvmyE4TbLSEKLLU/edit) Bug reference: bug 1435780 Is there a specific data scientist you would like or someone who has helped to triage this request: Not in particular but Su or Saptarshi may be most suited here since they previously helped with similar A/B testing focusing on identifying retention and engagement regressions.
Component: General → Experiment Collaboration
Romain can this wait until 2019?
Flags: needinfo?(rtestard)
(In reply to Jess Mokrzecki [:jmok] from comment #1) > Romain can this wait until 2019? Yes:) We want to run the study during 66 release cycle (Starting March 19th). I assume a study start date of March 25th (first monday post 66 launch) - User acquisition period: 1 week - Data collection period: 3 weeks - Data analysis period: 1 week We although need someone assigned way ahead of the study start to help make sure our approach is right and all necessary probes are available to avoid uplift complexity.
Flags: needinfo?(rtestard)

Update: this is now tracking Firefox 67 given delays on feature implementation.
Updated plan:

  • User acquisition period: 1 week, starting May 20th
  • Data collection period: 3 weeks, starting May 27th
  • Data analysis period: 1 week, starting June 17th

Hi Romain,

When are you thinking that you would like to have someone start to make sure your approach is right for the probes? We don't want to assign the bug until someone will be actively working on it.

Flags: needinfo?(rtestard)

(In reply to Jess Mokrzecki [:jmok] from comment #4)

Hi Romain,

When are you thinking that you would like to have someone start to make sure your approach is right for the probes? We don't want to assign the bug until someone will be actively working on it.

67 is now on Nightly and we need data science support now to get the experimenter process started and address any suggestions data science may bring whilst we have the flexibility to do so on nightly - the main thing coming to mind is validating that all necessary probes are available as expected but I'm sure other things will come up through data science review.

Flags: needinfo?(rtestard)
Assignee: nobody → wbeard
Status: NEW → ASSIGNED
Points: --- → 2
Points: 2 → 3
Attached file Requesting Design Review β€”
Attachment #9045013 - Flags: review?(tdsmith)
Comment on attachment 9045013 [details] Requesting Design Review Please let me know when this is ready to review :)
Attachment #9045013 - Flags: review?(tdsmith) → review-

Mea culpa; I hadn't seen that it was already in Experimenter. πŸ™‡β€β™‚οΈ

This looks good to me but if we expect certain kinds of breakage we might want to look for, or implement, probes that reflect that breakage more directly -- for example, maybe crash rates or counts of TLS errors, since browser usage and retention are relatively insensitive to inconveniences.

I also noticed that the old PHD described a list of populations to examine separately, and it would be good to make sure that you're powered to run all of those in each of those populations if those comparisons are important.

It would be good to double-check with Normandy engineers that the default branch will work for this pref, since it sounds like this involves early initialization -- I'm not totally sure how it works but I know that WebRender needed a user-branch pref in order to enable itself correctly.

What is the goal of the effort the experiment is supporting?

To improve DLL injection blocking by using a stub launcher process on Windows.

Is an experiment a useful next step towards this goal?

Yes; this deploys the stub launcher to a subset of Windows users.

What is the hypothesis or research question? Are the consequences for the top-level goal clear if the hypothesis is confirmed or rejected?

The stub launcher should have no negative impact on the user experience; a detectable negative effect would require diagnosis.

Which measurements will be taken, and how do they support the hypothesis and goal? Are these measurements available in the targeted release channels? Has there been data steward review of the collection?

URI count, usage hours, and retention are the proxies for the user experience. They are available and reviewed. launcherProcessState in the telemetry environment landed in 66.

A couple comments:

  • You could throw in active ticks. The more the merrier, you'll pull it for core product metrics anyway, and it could help illuminate any change in usage hours.
  • You could ask the data pipeline team to aggregate the launcherProcessState into main_summary if you like.

Is the experiment design supported by an analysis plan? Is it adequate to answer the experimental questions?

Uh, I assert "yes, and yes" based on the existence of a power analysis; the analysis seems straightforward. How do you plan to aggregate the usage metrics? Will you compute per-user sums of usage hours and URIs over the course of the experiment?

Is the requested sample size supported by a power analysis that includes the core product metrics?

Yes. A 2% change in retention seems subjectively large to me; Β―\_(ツ)_/Β―. If Romain's happy, I'm happy :)

Report is here.

No significant regressions were observed.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: