Add talos regression test for "cold" startups

RESOLVED FIXED

Status

()

defect
P1
normal
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: vladan, Assigned: jmaher)

Tracking

(Blocks 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 3 obsolete attachments)

We need a cold startup benchmark that takes into account I/O read costs.

Talos ts_paint only measures hot startups, i.e. the time to re-open the browser after most of its code and data have already been cached in memory.

Variations in I/O op times and the effects of OS pre-fetching will have to be accounted for.
Blocks: start-faster
this will be tricky with talos due to the restarting and repeated runs of this. we might have to trick buildbot/mozharness somehow if we are rebooting- not a trivial task.
Assignee: nobody → rvitillo
Instead of rebooting, we could use the NtSetSystemInformation API on Windows to clear the Standby List (see http://processhacker.sourceforge.net/doc/memlists_8c_source.html), i.e. flushing the page cache. We could do something similar on OSX and Linux.
Here are some numbers for firstPaint under various scenarios. To empty the file cache I used NtSetSystemInformation on Windows, purge on OSX and “sync && echo 3 > /proc/sys/vm/drop_caches” on Linux.

Windows (reboot):
mean: 1152
sd: 48

Windows (cache cleared):
mean: 1228
sd: 22

Windows (hot):
mean: 765
sd: 8

Linux (reboot):
mean: 859
sd: 22

Linux (cache cleared):
mean: 961
sd: 27

Linux (hot):
mean: 678
sd: 12

OSX (reboot):
mean: 1835
sd: 18

OSX (cache cleared):
mean: 2570
sd: 23

OSX (hot)
mean: 1737
sd: 13

The following observation can be made (measurements come from different machines):
- dropping the cache increases the average startup time on all OSs since also libraries that are loaded at startup by the system are purged.
-  Even though the standard deviation is slightly different under the various scenarios, the difference in the mean is statistically significant.

If you think that the numbers are acceptable, we can proceed to encapsulate the logic in a python module so that it can be used by talos.
Flags: needinfo?(vdjeric)
Flags: needinfo?(jmaher)
for windows and linux the reboot vs cache cleared are similar, but osx is way off.  Also the stddev (I assume that is sd) seems much higher, I assume that translates into more noise?

It would be interesting to see this run on our automation machines.
Flags: needinfo?(jmaher)
Joel, looking at reboot vs hot on OSX, it would seem that OSX preloads at startup quite many libraries. In any case the purge utility, which comes with the OS, seems to be the correct way to flush the file cache. Do you have something better in mind?

The difference in standard deviations for OSX are not particularly "big" imho, at least in statistical terms. An F-test fails to reject the hypothesis that the standard deviations are equal with alpha=0.05. On the other hand, a small increase in noise might be a reasonable tradeoff if we can account also for IO timings.
Flags: needinfo?(jmaher)
So my concern is that we are showing a real cold startup and measuring it, but will users ever experience this?  I understand we won't get 100% accurate, this seems a bit distorted- although it is better than hot startup only.
Flags: needinfo?(jmaher)
Whiteboard: [good first bug][mentor=rvitillo][lang=python]
This is a great good_first_bug.

To get started with talos here is a brief primer:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

To get started with Mercurial at Mozilla, here are some tips:
https://developer.mozilla.org/en-US/docs/Mercurial_FAQ

If you are considering this bug, please go ahead and start working on it.  You can ask questions in the bug or on irc in #perf.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3)
> If you think that the numbers are acceptable, we can proceed to encapsulate
> the logic in a python module so that it can be used by talos.

- I would have expected shorter startup times after rebooting Windows vs after flushing the cache. How long did you wait after a Windows reboot before starting Firefox? Windows Prefetch will pre-emptively read program code+data after each startup if it notices the user frequently launches the program after a reboot. 
- Does the NtSetSystemInformation call clear both standby and active pages in the disk cache?

(In reply to Joel Maher (:jmaher) from comment #6)
> So my concern is that we are showing a real cold startup and measuring it,
> but will users ever experience this?  I understand we won't get 100%
> accurate, this seems a bit distorted- although it is better than hot startup
> only.

I don't think users would ever see Firefox startups where their disk cache is completely empty, but they would experience similarly slow Firefox startups when Firefox code+data isn't cached, when the disk I/O bandwidth is saturated, or the system is under memory pressure and swapping. Note also that our existing hot startup test doesn't occur often in the wild either.

This cold test is useful because it assigns a cost to all the I/O needed to start Firefox. The absolute numbers won't necessarily be meaningful, but it should be able to catch real regressions/improvements. Noise will be an issue because there will be variations in disk drive performance between different test machines & test runs.
Flags: needinfo?(vdjeric) → needinfo?(rvitillo)
(In reply to Vladan Djeric (:vladan) from comment #8)
> - Does the NtSetSystemInformation call clear both standby and active pages
> in the disk cache?

You can verify this by looking at the "Metafile" row in the RamMap utility: http://technet.microsoft.com/en-ca/sysinternals/ff700229.aspx
(In reply to Vladan Djeric (:vladan) from comment #8)
> (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #3)
> > If you think that the numbers are acceptable, we can proceed to encapsulate
> > the logic in a python module so that it can be used by talos.
> 
> - I would have expected shorter startup times after rebooting Windows vs
> after flushing the cache. How long did you wait after a Windows reboot
> before starting Firefox?
I waited 1 minute.

> Windows Prefetch will pre-emptively read program
> code+data after each startup if it notices the user frequently launches the
> program after a reboot. 
Do you know at what point Prefetch will actually load the the program into memory? In a few trials I noticed that if I wait long enough, e.g. more than 10 minutes, then the startup time can be lower but that's not always the case.

> - Does the NtSetSystemInformation call clear both standby and active pages
> in the disk cache?
It clears only the standby pages.
Flags: needinfo?(rvitillo)
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #10)
> > - Does the NtSetSystemInformation call clear both standby and active pages
> > in the disk cache?
> It clears only the standby pages.

I think you can use NtSetSystemInformation to clear active pages as well. 
The CacheSet utility clears the active cache http://technet.microsoft.com/en-ca/sysinternals/bb897561.aspx
And apparently it uses NtSetSystemInformation as well http://www.autohotkey.com/board/topic/49114-replicating-cachesets-sysinterals-functionality-in-ahk/page-2
Joel, I uploaded a python module to pypi that allows to flush the disk cache:

To install it:
pip install cache_flusher

To use it:

import cache_flusher
cache_flusher.flush()

The flush() method required administrative privileges. Also, the python module comes with a Windows executable (see https://github.com/vitillo/Flush).

How do you suggest to proceed to add the module as dependency to talos?
Flags: needinfo?(jmaher)
we have a setup.py in talos that has dependencies: http://hg.mozilla.org/build/talos/file/b4907f0b27d3/setup.py#l13.

If we want to depend on this, we need to upload it to the internal pypi server (similar to bug 976797).  Depending on the code, etc. 

I recommend getting the module uploaded to pypi and adding a dependency to setup.py (with a specific version).  Then we can test on try and figure out what else is required (i.e. root access, windows binary compatibility, etc.).

We would also need to integrate this into talos, there are setup and cleanup functions that the test can specify:
http://hg.mozilla.org/build/talos/file/tip/talos/xperf.config#l20
Flags: needinfo?(jmaher)
Comment on attachment 8423199 [details] [diff] [review]
Add talos regression test for "cold" startups, v1

Review of attachment 8423199 [details] [diff] [review]:
-----------------------------------------------------------------

I assume this worked for you locally.  Please advise on which platform we should run this on, and I can get started playing with this in staging.
Attachment #8423199 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #15)
> Comment on attachment 8423199 [details] [diff] [review]
> Add talos regression test for "cold" startups, v1
> 
> Review of attachment 8423199 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I assume this worked for you locally.  Please advise on which platform we
> should run this on, and I can get started playing with this in staging.

It would be great if you could test it on Windows since that is the platform I do expect more trouble.
Depends on: 1012828
Depends on: 1013158
did a few retriggers once I got the database entries:
https://tbpl.mozilla.org/?tree=Try&rev=dc4ad268fdb2

this looks to be reliable and the numbers are stable (all seem to have 1 outlier and the rest in a +- 10ms range)
a full push is up here: https://tbpl.mozilla.org/?tree=Try&rev=60bbe32332e5 (please ignore the first 4 'o' pushes on linux32)

linux32:
ts_paint: 916.42, 927.11, 899.53, 903.53
ts_cold:  881.79, 882.47, 917.42, 899.58

OSX 10.6:
ts_paint: 1075.53, 1092.05, 1118.37, 1071.37
ts_cold:  1077.68, 1086.00, 1080.74, 1075.79

linux64:
ts_paint: 826.16, 833.05, 823.11, 820.63
ts_cold:  836.74, 846.95, 847.84, 829.74

Win7:
ts_paint: 836.16, 833.68, 883,11, 834.74
ts_cold:  833.95, 834.11, 856.68, 830.47

Win8:
ts_paint: 824.58, 839.74, 839.42, 837.37
ts_cold:  862.26, 853.89, 847.00, 825.11

WinXP:
ts_paint: 694.32, 673.58, 684.00, 683.26
ts_cold:  686.63, 684.37, 687.37, 688.05

OSX 10.8:
ts_paint: 832.32, 861.58, 1447.53, 844.32
ts_cold:  11880.79, 11416.26, 14249.42, 12352.68



with the exception of osx 10.8, all the numbers are almost identical.  I am wondering if we are really running anything here, or if Firefox is just that good :)
We get all kinds of errors:
sh: 1: cannot create /proc/sys/vm/drop_caches: Permission denied (on Linux)
The system cannot find the path specified. (on Windows)
sh: purge: command not found (on OSX)

I wonder why the measurements are so different on OSX between ts_paint and ts_cold if the purge command is not execute anyway. First things first, we need to have administrative privileges to run python module, how can we do that Joel?
Flags: needinfo?(jmaher)
chatting in irc, lets focus on windows 7 first and get that going.

Roberto can you outline the specific binary that needs to run?  We can figure out where the specific location is?

We can then work with the ops team to make this work.  We solved this for xperf on win7 to have admin privileges.
Flags: needinfo?(jmaher)
The binary is located in site-packages/cache_flusher-1.0.3-py2.7.egg/cache_flusher/data/Flush.exe where the location of site-packages is platform dependent. With the last changes I made to the cache_flusher module the location shouldn't matter, we just need admin privileges.
Q: how could we get Flush.exe to run in admin mode?  I know we did some magic for xperf (which is win7 only), I would like to figure out a way to do that again (ideally all platforms, but Win7 for now)
Flags: needinfo?(q)
Q, any next steps we can take here?
Joel, My apologies this got lost in a big shuffle  I think we can use the same elevated privileges fix as  xperf. Is this only called inside the harness?
Flags: needinfo?(q)
Q- this is a binary that is installed in the virtualenv as a dependent package (python package with a binary) but we only call it from the talos harness (just like we only call xperf from the talos harness).

I do believe the same xperf stuff will work in this case.
Q- I just realized that maybe this might be done with the xperf solution?  I can churn on this quickly, just let me know when you want to test and I can retrigger jobs quickly or do it on a loaner.
Flags: needinfo?(q)
Mentor: rvitillo
Whiteboard: [good first bug][mentor=rvitillo][lang=python] → [good first bug][lang=python]
Priority: -- → P1
With the escalation of this to P1 I will jump on working this fix in today for testing.
Flags: needinfo?(q)
So I discovered today that there used to be a "ts_cold" cold startup test. Roberto, have you seen these bugs? How many of these objections are applicable to your approach?

- bug 510587: Implement Mac/Linux cold startup Talos test
- bug 522807: Implement Windows cold startup test
- bug 532965: Mac cold startup is not cold enough
- Bug 619536: Turn off ts_cold tests for all platforms
Flags: needinfo?(rvitillo)
Even though it seems that none of the previous approaches used NtSetSystemInformation, the effects might be comparable. The main concern back then was noise but we never really got to the point where we tried my python module out. On my machine the noise didn't look too bad but the only way to know for sure is to actually run ts_cold on our talos machines.

Also, in Bug 532965, valid concerns are made about the OSX purge utilty not flushing the libraries which can be circumvented by either mounting and unmounting a volume, or by requesting 2x physical memory from the virtual memory pager and then touching every page. The latter approach though being considerably slow while the former instead might not be feasible easily on talos boxes.

My suggestion here is that before we move to more complicated solutions we first check if noise is a problem with what we have right now. If libraries are not flushed and we have already too much noise, it might not be worth to continue working on this. We might as well just be happy with the number of bytes read/written at startup that xperf can give us.
Flags: needinfo?(rvitillo)
This should now be set to inherit "Highest permissions" from the initial harness launch in win 7. Is someone available to run this through on a test system and aim me a the result to make sure we are including all the sub processes in the privilege escalation ?
I have retriggered a job that will run ts_paint_cold on windows 7, lets see how this works!
I have triggered this and have a log:
https://tbpl-dev.allizom.org/php/getParsedLog.php?id=43274581&tree=Try&full=1

I am not sure how to determine if this is working or not.  Roberto, can you help me out to see if you can determine what is working/not working based on the above log?
Flags: needinfo?(rvitillo)
Joel, I can't spot the output generated by Flush.exe which should be either: "Failure to set privileges" or "Flush FileCache WorkingSet : ok". Does the log from talos include also the output generated by executables launched in a subshell through os.system?
Flags: needinfo?(rvitillo) → needinfo?(jmaher)
it doesn't, but I can try to get it printed out!
Flags: needinfo?(jmaher)
here is what I get locally while trying to print it out:
https://pastebin.mozilla.org/5533148

I am not sure we have any logging printed.
Joel, the logging varies by OS since I am using different tools; on Linux nothing is reported.
I haven't been able to get any data from it :(  I have tried printing stdout,stderr- hard to validate if it is working- tomorrow i will test it with talos on my win7 vm.
I have a windows 7 loaner machine and I have some more information on this.

While running this on the loaner, I see a popup dialog for MSVCR110.dll.  I found an easy solution to this:
http://www.microsoft.com/en-us/download/details.aspx?id=30679#

I downloaded the redistributable, installed it and flush worked as one would expect.

options:
1) get flush.exe (https://github.com/vitillo/python_cache_flusher/tree/master/cache_flusher/data) to be build without needing a redistributable
2) install the msvcr110.dll on the system.

We should look at both options.  I think 1 would be most logical, but what if there is yet another file missing in the future?
Flags: needinfo?(rvitillo)
Flags: needinfo?(q)
for reference, here are some results:
https://pastebin.mozilla.org/5545400
Joel,

Thanks for testing it. I don't think we are going to require any other file in the future. I compiled that dll statically, could you please double check that it works "out of the box" for you now? You can upgrade cache_flusher through pip, once I have your OK I will ask to upload that package to our repository.

The numbers look stable enough to me but without a non-cold comparison on the same machine it's hard to say.
Flags: needinfo?(rvitillo) → needinfo?(jmaher)
I tested the new cache-flusher bits from pypi and this worked on my loaner.  Knowning this, we will get these on the internal pypi server, then we can test on try- stay tuned :)
Flags: needinfo?(q)
Flags: needinfo?(jmaher)
cache flushing is working on try:
https://tbpl.mozilla.org/php/getParsedLog.php?id=43832469&tree=Try&full=1

I have requested some retriggers, lets see how stable this is and get a better idea of time it takes to run.

right now it is taking ~10 minutes to run this test.

Once we get more results we can discuss next steps.

Thanks :rvitillo and :Q for your help on this so far.
Mentor: rvitillo
Duplicate of this bug: 522807
The current state is that the noise on the Talos box is just too high to get anything meaningful out of it. Running ts_paint_cold on my local machine (or Joel's for that matter) seems to exhibit though a much stabler behavior. That might be due to the fact that I am using an SSD.
my local computer is SSD only as well.  Maybe this is not something to embark on as part of talos?
Unfortunately the noise can't be reduced with defragmentation or by mounting an ISO image so this seems to be as good as it gets. Also note that talos machines are defragmented weekly.
The question becomes, do we move forward with this, or do we forget about this?
I suggest to run this test for a longer period of time, say a month, and see if we can learn anything from it.
Assignee: rvitillo → nobody
Whiteboard: [good first bug][lang=python]
we did some up front work on this, I envision this running in the xperf job since this needs to run on windows 7 only.  I have tested this on try for about 50 retriggers and it is green.
Assignee: nobody → jmaher
Attachment #8430839 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #8502022 - Flags: review?(wlachance)
Comment on attachment 8502022 [details] [diff] [review]
finish enabling ts_paint_cold for talos (1.0)

wlach is in a work week and will be travelling tomorrow, lets get a review from dminor.wlachance
Attachment #8502022 - Flags: review?(wlachance) → review?(dminor)
after some feedback from :avih, I have reshuffled the parameters around so we can actually take advantage of inheritance and define the right things.
Attachment #8502022 - Attachment is obsolete: true
Attachment #8502022 - Flags: review?(dminor)
Attachment #8503273 - Flags: review?(dminor)
Comment on attachment 8503273 [details] [diff] [review]
finish enabling ts_paint_cold for talos (1.1)

Review of attachment 8503273 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8503273 - Flags: review?(dminor) → review+
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Hey Joel, is ts_paint_cold currently running in the tree anywhere?
Flags: needinfo?(jmaher)
no, we removed that quite some time ago.  I recall the reasons being that it wasn't accurately measuring cold.  I believe we did some experiments to flush the cache and (as you can see in the comments here).  As to why those didn't work, I couldn't find it right away.  I recall something about the results not being useful.
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #57)
> no, we removed that quite some time ago.  I recall the reasons being that it
> wasn't accurately measuring cold.  I believe we did some experiments to
> flush the cache and (as you can see in the comments here).  As to why those
> didn't work, I couldn't find it right away.  I recall something about the
> results not being useful.

The variance of the data was just too high to detect regressions.
If we want to use this test for something now, would it be appropriate to just reopen this bug? Or should I file a follow up? Is anything in tree?
we have removed everything from the tree as this has been inactive for 6+ months.  We can consider a new test- it would be a low priority for me to work on.  Just mark the bug as a dependency to bug 1040081
(In reply to Jim Mathies [:jimm] from comment #59)
> If we want to use this test for something now, would it be appropriate to
> just reopen this bug? Or should I file a follow up? Is anything in tree?

My suggestion in the e10s perf criteria document was to use this harness (or manual testing if need be) to compare e10s cold-startup time *on an SSD*. The variance was excessive on Talos HDDs, but it was reasonable on laptop SSDs.
You need to log in before you can comment on or make changes to this bug.