Closed Bug 928017 Opened 11 years ago Closed 11 years ago

funnelcake build to test different sizes of Firefox exe

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 933847

People

(Reporter: aphadke, Assigned: nthomas)

References

Details

https://bugzilla.mozilla.org/show_bug.cgi?id=853301#c2 mentions that installers and partials for Firefox 27 will be 10% larger than usual.
Brendan Colloran suggested that we can perform an A/B test to understand impact of increased EXE size to install ratios.

From Brendan:
Would it be possible to do a more apples-to-apples experiment by serving a version of FF24 that just has an X MB extra file of junk? (A big uncompressed bitmap of white noise or something?

From bsmedberg:
I am very excited about this idea. I believe that adding extra junk data to the installer is a pretty trivial thing to do. What is the cost and possible timeframe of running a funnelcake experiment on FF24 downloads like this? Do we actually have to have a different builds of Firefox in order to measure the results properly? 

Nick - thoughts?
Assignee: nobody → nthomas
FWIW:

Also, if we do this for a few values of X, we could even then run a regression to assess the impact of each additional MB of installer size on successful installation, which imo would give us a lot more decision-making ability in the future. (It might help us quantify the trade-off: "is feature Y, which adds 1MB to the installer, worth an estimated 2% decrease in installation success?")
We should do Funnelcake tests on a regular basis, such as every 3 or 6 months or even every release.
OS: Mac OS X → All
Hardware: x86 → All
A big +1 to learning more from this test as we'd like to make sure we fully understand the cost increasing Firefox install size has installation/usage sucess rates overall, and especially in places like key emerging markets where bandwidth is extremely challenging.
There's a separate discussion happening on what the methodology and parameters of this study would be (eg # of different sizes, platforms/locales, timing). Once that is nailed down funnelcake builds can be created. 

In the meantime we can investigate how to pad the size, eg 
* include some junk file that compresses to just the right size
* append to the exe/dmg/tar.bz2 between compression and signing (probably asking for trouble)
I think we should just use a file of random bytes, which won't compress:
$ dd if=/dev/urandom of=junk bs=1024 count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB) copied, 0.0656007 s, 16.0 MB/s
$ ls -lh junk
-rw-rw-r-- 1 luser luser 1.0M Oct 24 09:11 junk
$ bzip2 junk
$ ls -lh junk.bz2 
-rw-rw-r-- 1 luser luser 1.1M Oct 24 09:11 junk.bz2

We could check said file into the tree (which sucks a little bit) or write a small script to generate a file using a PRNG and a given seed (to produce consistent output).
For the funnelcake Windows installer the junk file could be alongside the setup.exe inside the self-extracting archive. This will prevent the junk file from being installed

For example:
Firefox Setup XX.X.exe
  core\
  junkfile
  setup.exe

For the Windows portion of the study the stub data will provide us with much more accurate data for the process. For example, amount downloaded for cancelled installs as well as time spent downloading.
This funnelcake study was prompted by questions about the installer size increase caused by ICU bug 853301.
Blocks: 853301
I agree with bcolloran that it would be optimal to test a few different installer sizes so we can make predictions about the effect of increasing installer size.

To make this experiment simple and clean, I recommend testing on the same Firefox build with varying levels of installer size (funnelcake pushed out at the same time, tracked the same length of time). We also want to be able to identify different locations so that we can observe the effect of different installer sizes on emerging and non-emerging markets.

The data should contain information such as: (Geo, Size, Download, First Run Page)

To make sure that our model is consistent over time, we can run the funnel cake study once in a while (every 6-12 months?)
In email, cmore said:

> Before we can do a single funnelcake test or a real controlled experiment for
> the entire download-to-install experience, we need the instrumentation in
> place to be able to make this scientific. The current funnelcake builds add
> ?f=[funnelcakeID] to the /firefox/firstrun/ and /firefox/whatsnew/ urls that
> help us understand the tail end of the install process. What is missing now
> is that if and when the stub installer falls back to the
> /firefox/installer-help/ page, it does not pass along any ?f=[funnelcakeID]
> to the page. Those query parameters allow us to capture the data with Google
> Analytics and also vary the content and download links based on the
> funnelcake ID.
(In reply to Chris Peterson (:cpeterson) from comment #9)
> In email, cmore said:
> 
> > .... What is missing now
> > is that if and when the stub installer falls back to the
> > /firefox/installer-help/ page, it does not pass along any ?f=[funnelcakeID]
> > to the page. Those query parameters allow us to capture the data with Google
> > Analytics and also vary the content and download links based on the
> > funnelcake ID.

If that applied to the most recent funnelcake (bug 892848), then I'm surprised. I made this change to the stub installer code before building:

-!define URLManualDownload "https://www.mozilla.org/${AB_CD}/firefox/installer-help/?channel=releaseinstaller_lang=${AB_CD}"
+!define URLManualDownload "https://www.mozilla.org/${AB_CD}/firefox/installer-help/?channel=release&installer_lang=${AB_CD}&f=23"
I am concerned about the idea of doing this testing regularly.

1) Spinning up and running funnelcake A/B testing is a significant effort in IT, Metrics, and RelEng time.

2) Last time we did this, we were also hesitant to go full blast with funnelcake builds because there was still some concern that this system has a very small chance to isolate users on those builds, and make them unable to upgrade. Consequently we chose to do this only for a very small fraction of our user base, and only for as long as we had to. I don't think we can currently say with authority that "funnelcake builds are no more likely to get stuck on old versions than regular builds". My info might be out of date on this, though.

3) If we want to start doing regular funnelcake A/B testing, is this the best test to start with? Is there nothing we might rather measure on a regular basis, say for example something directly related to stub installer efficacy, or automatic update failures? Additionally, would we want to test *only* this thing? It might be worth testing several variables all at once (independently of course... some users test variable X, some users test unrelated variable Y, some users test nothing). If we're going to spend the time on funnelcake on a regular basis, we might as well get the maximum value out of it... there is some economy of scale (not a lot, but some).


If we really want to do this more than once, I'm going to be a stick in the mud and insist that we streamline the process before I commit to supporting that on the IT (product delivery) side. For my part that primarily means some additional automation between RelEng and Bouncer.

I suspect someone like akeybl or johnath might be a similar stick in the mud with respect to guaranteeing that we're not risking problems with our user base. Still others might want to throw their particular measurement scenario into the hat if we're going to start doing regular funnelcakes.
Jake: those are fair concerns. I see how one-off A/B tests are a big disruption and could risk breaking Firefox updates.

I agree that regular tracking of metrics like stub installer efficacy or automatic update failures would be useful. With that data, we might be able to infer correlations such as increased update failures after a particular installer size increase on day X (without running an A/B comparison with an inflated installer).
I am planning on making the repackaging process simpler for Windows since on Windows this is possible via the stub installer / installer. Other platforms will take approximately the same effort.

As for concern about abandoning users the issue that was present last year with the Firefox distributions code not applying preferences correctly has been worked around in app update so this should not be the case and has been verified by QA to no longer be the case. Note: bug 802022 is for the bug in the Firefox distributions code.

Simplification and automation around the bouncer changes needed would be a very good thing.

From the Windows side the stub installer data ping provides many more data points that help us understand the download and install process along with much more accurate data than funnelcake which is impacted by bots performing downloads that never perform an install as well as other factors such as installing the same version won't open the first run page. We also get a much larger sample than we do with funnelcake and continuously gather this data. The main drawbacks are it takes more effort to understand what the data actually means and it is different than the funnelcake data that is available for other OS's so a comparison between Windows stub data and Mac funnelcake data for instance is not of much value.

Note: we have telemetry probes for update result codes.
(In reply to Robert Strong [:rstrong] (do not email) from comment #13)
> From the Windows side the stub installer data ping ...

Robert: How long is the stub installer data you describe in bug 811573 comment 50 archived? Can someone on the Metrics (?) team review that data to see if there any increase in Exit_Code 10 (download cancelled by the user), Exit_Code 11 (too many attempts to download), Download_Phase_Seconds, or Last_Download_Seconds starting with Nightly 26.0a1 build 2013-08-14?

The presence or lack of any correlation might not answer our original question, but it would give us some preliminary insights. :)
Flags: needinfo?(robert.bugzilla)
Chris, not sure but Anurag should be able to so needinfo'ing him.
Flags: needinfo?(robert.bugzilla) → needinfo?(aphadke)
Chris, I suspect that the current data set that I used to analyze the stub would be sufficient. It covers 2013-07-08 through 2013-08-25.
I also have 2013-10-01 through 2013-10-22
chris - we have data from 6/28 - 10/23 for download-stats.mozilla.org. Please file a bug to get ssh access to peach-gw.peach.metrics.scl3.mozilla.com (metrics component) and I can walk you through the steps to get the data.
Flags: needinfo?(aphadke)
Assuming that the stub-installer was rolled out to localized builds, is it possible to leverage & access this data to identify a bias for failure of big installers by country? This would help us identify whether big installs are negatively impact/failing in regions we suspect of having poor internet connectivity.
I believe so, when I last looked at the data a couple of months ago Indonesia (id locale) showed a much larger cancellation rate during the download phase and successful installs also took quite a lot longer than other locales. Since these are server logs it would also be possible to use the ip address to figure out geolocation though with the above I don't think it is necessary to do so.
and yes, your assumption that the stub installer was rolled out to localized builds is correct.
(In reply to Robert Strong [:rstrong] (do not email) from comment #20)
> I believe so, when I last looked at the data a couple of months ago
> Indonesia (id locale) showed a much larger cancellation rate during the
> download phase and successful installs also took quite a lot longer than
> other locales. Since these are server logs it would also be possible to use
> the ip address to figure out geolocation though with the above I don't think
> it is necessary to do so.

Great! How might I access this data in order to begin identifying more of these particular regions?
Anurag from metrics is the person to ask so needinfo'ing him.
Flags: needinfo?(aphadke)
[:rstrong] [:cperterson]

Something we may want to keep in mind is that by looking at the current stub installer data, we are unable to separate the effects we observe due to new Firefox version and the size of the installer, since the size of the installer is aggregated over FF versions. In order for us to state the relationship between installer size and completion rate with more confidence, we would be to keep the FF version consistent, while varying only the junk size x from 0 to n.
Though it is extra work, the version can be determined by the date / time the download started since it is always the latest version as of that date / time and the size can be determined by that. Also, for completed downloads the size is in the ping data.
With the above in mind, you can determine the size of cancelled downloads by the size of successful downloads that started around the same time.

BTW: for the junkfile case I agree that it should be much more controlled as to which package is being download and we can differentiate that when building the custom stub for funnelcake.
How big to we risk making the junkfile? How many size increments do we need to see a pattern? 30, 35, and 40 MB versus the control group?

Windows installer sizes today:

* Firefox 24 = 22 MB ---------- ---------- --
* Beta 25    = 22 MB ---------- ---------- --
* Aurora 26  = 24 MB ---------- ---------- ----
* Nightly 27 = 26 MB ---------- ---------- ------
* Chrome 30  = 35 MB ---------- ---------- ---------- -----

In a dev-planning thread [1], chofmann said that Netscape ran similar tests about a decade ago. The Netscape download was 30 MB and their logs showed a "dramatic hockey stick" of downloads dropping near the 10-15 MB point.

Can we measure download failure points without adding a junkfile? The download failure point seems more useful than a binary measurement of "download failed yes/no".

[1] https://groups.google.com/d/msg/mozilla.dev.planning/hPgUBzweL70/JkQpZiSB3DQJ
just to add a bit more on how measurement was done in the past.

the netscape installer was made up of functional components related to optional parts of the install package.

there was a stub, then the stub pulled down about 4 or 5 components, the biggest being java.  the stub had UI for optionally selecting each of the basic components and the default was to select all components.  As the stub completed the download of each component it would ping back for the next, allowing us to track these pings and gather stats on the progress of each install.  It also allow some insight into how many people were interested in customizing their install, and which of the components were most, and least, popular.

In this way we could provide a "developer" component, and also maybe bundle, or un-bundle, other parts of the product into the default install.

The components could also just be blobs of installer data that would also allow some level of tracking.

Also if I recall right, a pretty small pct. of failures were actually the result of user action to cancel the install process.  Most were just dropped connections.  It would be interesting to measure both again as well.
(In reply to jbeatty from comment #22)
Please file a Metrics component bug to get SSH access to peach-gw.peach.metrics.scl3.mozilla.com (https://bugzilla.mozilla.org/enter_bug.cgi?product=Mozilla%20Metrics)
We have geo data (by country and ISP) for download-stats.mozilla.org. Once you have access, I will walk you through the steps on how to get the data or run the intended queries using HIVE, it's really simple.
Flags: needinfo?(aphadke)
(In reply to jbeatty from comment #22)
Please file a Metrics component bug to get SSH access to peach-gw.peach.metrics.scl3.mozilla.com (https://bugzilla.mozilla.org/enter_bug.cgi?product=Mozilla%20Metrics)
We have geo data (by country and ISP) for download-stats.mozilla.org. Once you have access, I will walk you through the steps on how to get the data or run the intended queries using HIVE.
(In reply to jbeatty from comment #22)
Please file a Metrics component bug to get SSH access to peach-gw.peach.metrics.scl3.mozilla.com (https://bugzilla.mozilla.org/enter_bug.cgi?product=Mozilla%20Metrics)
We have geo data (by country and ISP) for download-stats.mozilla.org. Once you have access, I will walk you through the steps on how to get the data or run the intended queries using HIVE.
Filed bug https://bugzilla.mozilla.org/show_bug.cgi?id=931394 for jbeatty to get access to hive.
(In reply to chris hofmann from comment #28)
> Also if I recall right, a pretty small pct. of failures were actually the
> result of user action to cancel the install process.  Most were just dropped
> connections.  It would be interesting to measure both again as well.

Does the Firefox stub installer to report downloads where a frustrated user just gave up without completing the installation?
Yes it does though it obviously is unable to determine that the user was frustrated which is usually due to being unable to download over a flaky connection.

There is also bug 860873 so we can gather data as to why the user cancelled.
(In reply to Chris Peterson (:cpeterson) from comment #27)
> How big to we risk making the junkfile? How many size increments do we need
> to see a pattern? 30, 35, and 40 MB versus the control group?
> 
> Windows installer sizes today:
> 
> * Firefox 24 = 22 MB ---------- ---------- --
> * Beta 25    = 22 MB ---------- ---------- --
> * Aurora 26  = 24 MB ---------- ---------- ----
> * Nightly 27 = 26 MB ---------- ---------- ------
> * Chrome 30  = 35 MB ---------- ---------- ---------- -----
> 
> In a dev-planning thread [1], chofmann said that Netscape ran similar tests
> about a decade ago. The Netscape download was 30 MB and their logs showed a
> "dramatic hockey stick" of downloads dropping near the 10-15 MB point.
> 
> Can we measure download failure points without adding a junkfile? The
> download failure point seems more useful than a binary measurement of
> "download failed yes/no".
The stub installer tries its little heart out to successfully download so what you will see are people cancelling the download and the stub installer ping has the amount downloaded.

The data points are available in bug 811573 comment #50

also be aware that the ordering of the values was updated in
bug 811573 comment #65
(In reply to chris hofmann from comment #28)
>...
> Also if I recall right, a pretty small pct. of failures were actually the
> result of user action to cancel the install process.  Most were just dropped
> connections.  It would be interesting to measure both again as well.
On dropped connection the new stub attempts to re-download. It is a comparison of apples and oranges similar to how the funnelcake data doesn't report success accurately as noted in comment #13.
Hey Nick or Anurag - When will this test go live?
Still trying to figure out the logistics, will update the bug soon.
(In reply to Nick Thomas [:nthomas] from comment #10) 

> If that applied to the most recent funnelcake (bug 892848), then I'm
> surprised. I made this change to the stub installer code before building:
> 
> -!define URLManualDownload
> "https://www.mozilla.org/${AB_CD}/firefox/installer-help/
> ?channel=releaseinstaller_lang=${AB_CD}"
> +!define URLManualDownload
> "https://www.mozilla.org/${AB_CD}/firefox/installer-help/
> ?channel=release&installer_lang=${AB_CD}&f=23"

Ah ha! Awesome and I didn't know you added that before. I confirmed this to be the same with Google Analytics for funnelcake 24 and 25. I saw about 4.1% of people that downloaded those funnelcakes fall back to the installer-help page. That means we can now added logic to the installer-help page to do something different with the download button on that page when the f parameter is in the URL.
This moved to bug 933847 + depends.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.