Closed Bug 1362377 Opened 7 years ago Closed 7 years ago

omni.ja vs ts_paint trade off

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(Performance Impact:high, firefox55 fixed)

RESOLVED FIXED
mozilla55
Performance Impact high
Tracking Status
firefox55 --- fixed

People

(Reporter: jpr, Assigned: catlee)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink])

Attachments

(1 file)

Bug 1340157 landed Feb 17 and caused a 2-3% regression in start up time as seen in 1340873.

I'm not sure how much we win on the download size, but given that start up time (at least user perception of it) is an important focus for Quantum, we should be clear about the tradeoff we are making.
So the end result of that bug is that we left omni.ja compressed, which un-intuitively increases the download size.

In https://bugzilla.mozilla.org/show_bug.cgi?id=1340157#c6 I summarized the talos / telemetry results. It appeared as though the ts_paint regression wasn't corroborated by telemetry, and the advice at the time was that telemetry numbers are a better indicator of the user's experience.

Other impacts of the uncompressed omni.ja files are that there is a small increase to memory usage.

I'd be happy to re-revert this patch and get much smaller downloads, since that's what I was initially after!
To summarize some of the information from other bugs:

Using uncompressed omni.ja files significantly reduces download size for partial updates and initial install.
It also seems to improve startup time as measured by talos' ts_paint, but that is harder to verify with telemetry.
It also increases memory usage by ~8MB

Right now we are using compressed omni.ja files everywhere.

More background:

* omni.ja files were originally compressed, and were always compressed on aurora, beta and release channels.

* bug 1231379 landed in 2015-12-17 to disable compression on all desktop platforms.
This change never rode the trains past nightly. I estimated we would see an improvement of ~15% for Windows installer size, and up to 40% improvement in partial update size.

* bug 1233214 indicated some talos regressions / improvements.
We saw an improvement in ts_paint, and a regression in memory usage
No clear resolution on the impact, and telemetry didn't seem to confirm the talos data

* I wanted to be be able to make a decision about whether to let this ride the trains past nightly, or to back out the change from nightly. Unfortunately, telemetry and talos data was no longer available for the original change, so I filed bug 1340157 to track the impact of re-enabling compression for omni.ja

* bug 1340157 landed on 2017-02-16 and re-enabled compression for omni.ja on nightly

* bug 1340873 indicated the new set of talos regressions/improvements

* I summed up the talos and telemetry data here: https://bugzilla.mozilla.org/show_bug.cgi?id=1340157#c6
Based on that data I decided to leave the omni.ja files compressed.
At the time it appeared as though the telemetry data contradicted the talos data. Looking back now, that trend isn't clear any more to me.
Is the Talos number warm-start or cold-start number? If I remember the past corrctly, the tradeoff about omnijar compression has been about cold I/O: compressing makes cold start faster because cold start is highly gates on I/O. It makes warm start slower because CPU becomes more important and we can't do direct memory mapping of omnijar.

Telemetry for startup time is *very* difficult to model, because it aggregates both warm and cold and other starts all together, so analyzing it for small signal is hard. I actually trust our talos numbers rather more strongly. The links you provide at https://bugzilla.mozilla.org/show_bug.cgi?id=1340157#c6 show the 95th percentile, which would in general be the much higher cold-start times. I'm not sure I see any improvement at all, but look at the same graphs with 25/50/75/95 percentile charted:

session restored: https://mzl.la/2pgV5vX
firstpaint: https://mzl.la/2pICQm2
Would it be helpful to have Dominik repeat the warm and cold startup time metrics documented at https://docs.google.com/a/mozilla.com/document/d/1jyZ-GCQwR2EGYsbb_fTADYnpFJyxrmsRjRpUvf6GBAw/edit?usp=sharing against equivalent builds with compression on and off, to see more precisely how it affects cold versus warm?
Flags: needinfo?(catlee)
talos is warm start as we create a profile warm up it on a first launch, then start measuring the startup time for 20 iterations.  We do have a session restore test that doesn't warmup the profile.

In addition, we do not do a cold start of the OS, so that plays into real cold start times.
See also bug 1352595 where glandium is looking to use brotli for compression inside omni.ja. This should reduce omni.ja size *and* CPU usage for decompression.

catlee also suggested a good idea in that bug, which would be to put all omni.ja files accessed during startup in a single compression context that was bulk decompressed on startup. That will result in a smaller omni.ja *and* faster decompression since decompressors get much faster when you feed them chunks larger than a few dozen kb. The idea could be extended to related files. e.g. you could group sync files, devtools files, etc. Of course, this would require a refactoring to omni.ja reading since there would no longer be a single index (the jar file itself).
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> Would it be helpful to have Dominik repeat the warm and cold startup time
> metrics documented at
> https://docs.google.com/a/mozilla.com/document/d/1jyZ-
> GCQwR2EGYsbb_fTADYnpFJyxrmsRjRpUvf6GBAw/edit?usp=sharing against equivalent
> builds with compression on and off, to see more precisely how it affects
> cold versus warm?

Yes, I've already asked him to test out 
https://archive.mozilla.org/pub/firefox/nightly/2017/05/2017-05-03-03-02-12-mozilla-central/firefox-55.0a1.en-US.win64.zip
vs
https://www.dropbox.com/s/xwcv4swmwgv33th/firefox-55.0a1.en-US.win64-new.zip?dl=0

They should be identical except for omni.ja compression.
Flags: needinfo?(catlee)
Setting a flag to track next steps.
Flags: needinfo?(dstrohmeier)
(In reply to Chris AtLee [:catlee] from comment #7)
> Yes, I've already asked him to test out 
> https://archive.mozilla.org/pub/firefox/nightly/2017/05/2017-05-03-03-02-12-
> mozilla-central/firefox-55.0a1.en-US.win64.zip
> vs
> https://www.dropbox.com/s/xwcv4swmwgv33th/firefox-55.0a1.en-US.win64-new.
> zip?dl=0
> 
> They should be identical except for omni.ja compression.

Here's the outcome of comparing these builds:
https://metrics.mozilla.com/protected/shiny/dstrohmeier/omnija/

Results show that the build without omni.ja compression is faster for both First Paint (64ms in median) and Hero Element (161ms in median). First Paint is application appears on fullscreen, Hero Element is the "search" placeholder in the search box of the content.

In addition, the test runs for the build without compression have a much lower standard deviation resulting in a more streamlined experience for users when doing multiple start ups.
Flags: needinfo?(dstrohmeier)
I re-ran the comparison again on today's Nightly as the previous numbers were off to what we had measured last week for Nightly. There is also drop for ts_paint on May 4, 2017:
https://archive.mozilla.org/pub/firefox/nightly/2017/05/2017-05-08-03-02-04-mozilla-central/firefox-55.0a1.en-US.win64.zip vs https://www.dropbox.com/s/xwcv4swmwgv33th/firefox-55.0a1.en-US.win64-new-20170508.zip?dl=0

Based on a comparison of just 5 runs, the results for these two builds are the same. In median, both reach First Paint at 1133ms and Hero Element at 1650ms.
Don't forget about the PGO specific optimizations that are done on omni.ja. We probably should be comparing ts_paint on PGO builds only.
Can we see the results of an inference test between these two builds, not just links to boxplots? We should track the significance of the difference rather than eyeballing charts for these types of tests.  I expect that the conclusions will be the same, but we need to keep the evidence of testing observable within the context of this bug.  Reporting the inference test used, the difference in median, and the p-value of the difference should be sufficient.
Flags: needinfo?(dstrohmeier)
As outlined above, the current testing is done based on 5 runs per build. Find the data here: https://docs.google.com/a/mozilla.com/spreadsheets/d/1LW3Skd2RhTTtypV5_rIv09jAnU5iu7UUHZ-c-3our_U/edit?usp=sharing. I think it's pretty obvious that the numbers are comparable. The ones in the chart are outdated as per comment 10.

I understand that this is far off from being scientific, but given the time constraints and running these tests manually, this is what we currently can afford to do.

Once we get into the range of running these tests 100 times (or even in an automated way allowing us to collect even larger sets of data), it will definitely make sense for us to follow the recommended procedure of reporting results.
Flags: needinfo?(dstrohmeier)
This is a public bug, and your link is behind LDAP and thus not accessible for non-Mozilla employees (and neither is that spreadsheet).  Assume it is possible that metrics.mozilla.com could disappear or that we lost that link; we would never be able to replicate your conclusion.  That is why we should record the results of the comparison in this bug; at least note the differences in medians, and ideally the significance of that difference *even if the p-value is strange due to small samples*.

At a minimum, please record the observable numbers (the boxplot cutoff values) for each group that you report in your charts here.
Flags: needinfo?(dstrohmeier)
Is there enough information in here to make a call on this? The trade-off is download size vs first paint.

I think somebody just needs to choose which one we care about more. jpr, do you want to make a call?
Flags: needinfo?(jpr)
Whiteboard: [qf] → [qf:p1]
From catlee:

just to be clear in https://bugzilla.mozilla.org/show_bug.cgi?id=1362377#c15 - this latest regression is a regression in both ts paint AND file size

if we disable compression in omni.ja again, we will decrease file size and improve ts paint times in talos anyway

So this kinda sounds like a no-brainer. We should disable compression.

catlee, are you willing to do this?
Flags: needinfo?(jpr) → needinfo?(catlee)
The bug summary should be changed because this is not a trade-off.
Note that omni.ja was uncompressed on nightly only. It has never been uncompressed on other branches. So while there may be a regression for nightly, things have never actually changed for users.

Bug 1352595 is meant to make omni.ja both smaller and faster to read.
Originally it was thought to be a tradeoff because of telemetry data, but that data ended up being neutral according to :catlee.  I suspect we should just do it unless we have any concerns about stability, and let it ride the 55 train.
(In reply to Mike Hommey [:glandium] from comment #18)
> Note that omni.ja was uncompressed on nightly only. It has never been
> uncompressed on other branches. So while there may be a regression for
> nightly, things have never actually changed for users.
> 
> Bug 1352595 is meant to make omni.ja both smaller and faster to read.

Although it makes the omni.ja smaller, it doesn't do as much to improve download sizes.

e.g. for two nightly builds, partial update sizes are:
10,114,084 original partial
10,018,540 partial between mars with omni.tar files with brotli compressed contents
 8,497,367 partial between mars with uncompressed omni.ja files

complete updates are a bit better:
54,756,449 original
52,314,206 brotli
51,196,058 uncompressed omni.ja

and finally full installers:
45,395,514 original
43,202,792 brotli
40,142,349 uncompressed omni.ja

Do we know the performance implications of the broli compressed omni.ja files yet?
Flags: needinfo?(catlee)
Perhaps we should land the decompressed omni.ja patch now, and then evaluate brotli when it's ready?
Sounds like a fine plan to me.
(In reply to Chris AtLee [:catlee] from comment #21)
> Perhaps we should land the decompressed omni.ja patch now, and then evaluate
> brotli when it's ready?

Yes.  FWIW switching to brotli or other compression schemes is totally beyond the scope of this bug.
Attachment #8868558 - Flags: review?(mconley)
Comment on attachment 8868558 [details]
Bug 1362377: Disable omni.ja compression

Let's do it!
Attachment #8868558 - Flags: review?(mconley) → review+
Comment on attachment 8868558 [details]
Bug 1362377: Disable omni.ja compression

https://reviewboard.mozilla.org/r/140168/#review143650

Whoops - should probably have done this through MozReview.
Pushed by catlee@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/05c7e4f31712
Disable omni.ja compression r=mconley
https://hg.mozilla.org/mozilla-central/rev/05c7e4f31712
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 55
Please flag bugs where you plan to increase memory with the MemShrink whiteboard tag.

Chris what kind of measurements did you do? I see a reference to ~8MB in comment 2, was that just for the parent process or is it across the board? Generally speaking an 8MB regression even in one process is enough for me to encourage a backout.
Flags: needinfo?(catlee)
Whiteboard: [qf:p1] → [qf:p1][MemShrink]
a lot of installer size improvements:
== Change summary for alert #6741 (as of May 19 2017 15:04 UTC) ==

Improvements:

  8%  installer size summary windows2012-32 opt      53,621,828.92 -> 49,236,118.42
  8%  installer size summary windowsxp opt           53,684,430.42 -> 49,296,799.25
  8%  installer size summary windows2012-32 pgo      55,071,287.67 -> 50,652,881.67
  7%  installer size summary windows2012-64 pgo      59,212,856.92 -> 54,777,425.25
  7%  installer size summary windows2012-64 opt      58,863,314.75 -> 54,476,106.17
  7%  installer size summary windows8-64 opt         58,921,135.67 -> 54,534,357.75
  6%  installer size summary linux64 opt static-analysis57,762,442.42 -> 54,464,938.92
  6%  installer size summary linux64 opt artifact    58,120,355.83 -> 54,816,022.42
  6%  installer size summary linux64 opt             58,134,365.25 -> 54,836,783.83
  6%  installer size summary linux64-stylo opt       59,805,760.83 -> 56,506,713.17
  6%  installer size summary linux32 opt             58,818,503.00 -> 55,579,628.08
  5%  installer size summary linux64 pgo             63,603,808.25 -> 60,307,555.75
  5%  installer size summary linux32 pgo             63,809,020.67 -> 60,619,178.58
  5%  installer size summary linux64 debug static-analysis70,141,543.83 -> 66,873,595.83
  2%  build times summary windows2012-32 pgo taskcluster-c4.4xlarge4,156.88 -> 4,067.01
  2%  installer size summary linux64 asan            214,984,531.00 -> 211,676,488.08

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=6741
(In reply to :Dominik Strohmeier [:dstrohmeier] from comment #13)
> As outlined above, the current testing is done based on 5 runs per build.
> Find the data here:
> https://docs.google.com/a/mozilla.com/spreadsheets/d/
> 1LW3Skd2RhTTtypV5_rIv09jAnU5iu7UUHZ-c-3our_U/edit?usp=sharing. I think it's
> pretty obvious that the numbers are comparable. The ones in the chart are
> outdated as per comment 10.

To complete, here's the data for the five runs that are also captured in the linked spreadsheet:

With omni.ja compression (first paint / Hero Element in frames at 60fps):
(68 / 99), (68 / 99), (68 / 99), (68 / 100), (68 / 100)

Without omni.ja compression:
(68 / 99), (69 / 99), (68 / 99), (68 / 100), (67 / 99)
Flags: needinfo?(dstrohmeier)
As indicated in comment 3, compression was meant to improve cold startup. Dominik, are any of the measurements you provided here for cold startup?
Flags: needinfo?(dstrohmeier)
No, I haven't done any other measurements on cold startup as this wasn't defined to be in focus for 57. However, we do have a measurement setup available at SF AllHands. Shall we run one round of measurements together? I'll only have some time to do it this week as then I'll be out for 2 months and then join a new team.
Flags: needinfo?(dstrohmeier)
Bug 1352595 - Switch to !zlib for omni.ja compression

Again regressed the improvements made by this Bug in terms of size.
Blocks: 1303172
Assignee: nobody → catlee
Component: Build Config → General
Product: Firefox → Firefox Build System
Target Milestone: Firefox 55 → mozilla55
Performance Impact: --- → P1
Whiteboard: [qf:p1][MemShrink] → [MemShrink]
You need to log in before you can comment on or make changes to this bug.