Closed Bug 696053 Opened 10 years ago Closed 9 years ago

Make Ts Shutdown measure more realistic (and possibly less noisy) values

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: mak, Assigned: jmaher)

Details

Right now Ts_Shutdown uses a SetTimeout(0) call to init shutdown, this has a couple disadvantages:
1. Noise, SetTimeout(0) has this implicit in itself. threads scheduler may cause any different kind of ordering here.
2. it's unlikely users are going to evaluate our performance by closing browser at time 0 from startup
3. all the delayed initialization has a shutdown cost, but we don't measure it

The idea is to change SetTimeout(0) to a more meaningful value. Browser's delayedStartup inits the last long-delayed task after 10 seconds from the browser-delayed-startup-finished notification, we we may use SetTimeout(15000) or SetTimeout(20000) from that notification, to stay on the safe side.

What we should figure out is:
1. Is the result less noisy?
2. Does the shutdown time increase by a meaningful value? Ideally we would now measure something that the user is really hitting when closing the browser, included the time needed to shutdown all the delayedStartup stuff.

If these end up being true, it would be a nice win.

The drawback is a new Tshutdown baseline across all the Talos results.
(In reply to Marco Bonardo [:mak] from comment #0)
> The idea is to change SetTimeout(0) to a more meaningful value. Browser's
> delayedStartup inits the last long-delayed task after 10 seconds from the
> browser-delayed-startup-finished notification, we we may use
> SetTimeout(15000) or SetTimeout(20000) from that notification, to stay on
> the safe side.
> 
> What we should figure out is:
> 1. Is the result less noisy?
> 2. Does the shutdown time increase by a meaningful value? Ideally we would
> now measure something that the user is really hitting when closing the
> browser, included the time needed to shutdown all the delayedStartup stuff.
> 
> If these end up being true, it would be a nice win.
> 
> The drawback is a new Tshutdown baseline across all the Talos results.
The other drawback is that you cause a longer total test-turn around time (time from push to "got my results"). That's not a deal-breaker, and I'm all for measuring more real-world numbers, but be aware that this is also a something to think about when making tradeoffs. 

We are keeping up with load on all branches right now, but there is precious little breathing room in that metric.  Machines are on their way, but they are a long way off.  So, keep that in the back of your mind when weighing new Talos tests, especially ones that need to run two tests concurrently in order to re-baseline the numbers.
(In reply to Clint Talbert ( :ctalbert ) from comment #1)
> The other drawback is that you cause a longer total test-turn around time
> (time from push to "got my results").

yes, by 15-20 seconds, that overall is likely not a big deal for anyone.

> So, keep that in the back of your mind when weighing
> new Talos tests, especially ones that need to run two tests concurrently in
> order to re-baseline the numbers.

Well in this case the resources don't change, they are instead better used to report something meaningful rather than wasted in reporting something meaningless.

The alternative is to completely kill this test, I just made a patch that removed ALL the Places work from Ts shutdown path and the test regressed, that means this test is completely pointless as it is.
(In reply to Marco Bonardo [:mak] from comment #2)
> yes, by 15-20 seconds, that overall is likely not a big deal for anyone.

And actually, considered the speed of Ts shutdown tests, it's likely nobody will ever notice since most of the other tests take far more.
currently we do 20 cycles for a ts test, so we are talking about adding 6:40 to the total runtime of ts proper.  This is multiplied by: ts, ts_paint, ts_places_generated_min, ts_places_generated_med, ts_places_generated_max, remote_ts.

So just changing this without considering the cost to our infrastructure will cost 40 minutes of machine time per push.

So do we need to do 20 cycles for ts, or is 10 enough?  Do we need to do this on all ts tests?  do we need ts and ts_paint running side by side?  Do we need this on mobile, is 20 seconds enough?  Obviously there are options to reduce the impact this will have, we just need to balance it.


The change is easy and will only affect ts shutdown, but I would like to get a better idea of how to roll this out.
(In reply to Joel Maher (:jmaher) from comment #4)
> So do we need to do 20 cycles for ts, or is 10 enough?

You may test how stable are results with 10 or 15 cycles, I don't remember how many max and min results we drop, just one per side or 2? in the former case 10 may be fine, in the latter case it's likely 15 would be better.

> Do we need to do this on all ts tests?

Where we don't do this we have a useless test, that people will likely ignore, so I'd say yes, otherwise we waste resources.

> do we need ts and ts_paint running side by side?

what does this mean? you mean using Ts to also measure Ts_paint? It may save resources.

> Do we need this on mobile

Mobile has a different startup path, and a different browser.js, likely it should be handled apart, in a separate bug.

> is 20 seconds enough?

15 seconds may already be enough since the last delayed startup is at 10 seconds, so 20 seconds are likely more than enough. It really depends on noise, that is likely measurable by empirical testing on a real box.

On the other side, I'm starting thinking we may evaluate killing some dirty profiles tests, all the regressions I saw there were also reported by other tests, the databases used are far from users avg databases, numbers are not so far from the basic Ts and Places is practically doing the minimal work on startup and shutdown, nowadays. We may drop Ts dirty min as a first step, that shows same exact behavior and numbers as basic Ts.
So, a compromise may be
- delay by 15 seconds (measure if acceptable noise)
- do 15 cycles instead of 20, or even 10 (measure if acceptable noise)
- drop ts_places_generated_min
Leaving ts_remote alone for now, since mobile is different. Should be 15 minutes, minus current ts_generated_min time (let's say 1 minute), 10-14 mins per push.
we are running ts and ts_paint right now.  Some developers do not want to turn this off after months of running these side by side.  

right now ts_remote runs the same code as ts, so we are making this more complex.  

so we have tests defined for:
{ts_cold|ts_places}_{min|med|max}

we don't seem to be running the cold tests anywhere, but the talos dirty tests include all the ts_places_* tests:
    'dirty': {
        'enable_by_default': True,
        'suites': GRAPH_CONFIG + ['--activeTests', 'ts_places_generated_min:ts_places_generated_med:ts_places_generated_max'],
        'options': (TALOS_DIRTY_OPTS, ALL_PLATFORMS),
    },

I don't know who is required to make these decisions, but we can look into removing these dirty tests if they are not useful.  Is generated_min not useful?  


To address the 10/15 cycle question, I don't know where we calculate the numbers.  It appears as if we are sending the whole data set to the graph server and the graph server must do some magic.
this bug was filed for disabling ts vs tpaint: https://bugzilla.mozilla.org/show_bug.cgi?id=660124, it is a dup of another bug, but we could push on that to cleanup a few things.
(In reply to Joel Maher (:jmaher) from comment #7)
> I don't know who is required to make these decisions, but we can look into
> removing these dirty tests if they are not useful.  Is generated_min not
> useful?

while med and max sometimes have been useful, I can't really think of a situation where min has been. Ts is already a good measure for min. Dietrich can likely approve this change, as module owner. They were really useful some time ago when we started working on specific perf issues.
I have filed some time ago a bug to improve med and max using Places telemetry data, as they are now they are likely not-so-useful, but still more useful than min.
Once fixed with real telemetry data they may be somehow-useful.

> To address the 10/15 cycle question, I don't know where we calculate the
> numbers.  It appears as if we are sending the whole data set to the graph
> server and the graph server must do some magic.

iirc there is internal correction by dropping the higher and lower results, standalone talos does the same. I just don't remember how many are dropped.
ah, cold tests were disabled quite some time ago, when we discovered they were measuring entropy in the universe.
in standalone talos we generate to the screen or csv files.  Here is the code that does that:
http://hg.mozilla.org/build/talos/file/f9abfefec8aa/run_tests.py#l125

That takes an average and excludes the maximum value.  Personally I think it should remove the max and min and take a mean value of the rest (called a 80% mean) assuming we did 10 runs.

I guess we can assume the graph server does a similar calculation (average_minus_max)
(In reply to Marco Bonardo [:mak] from comment #5)
> (In reply to Joel Maher (:jmaher) from comment #4)
> 
> > Do we need this on mobile
> 
> Mobile has a different startup path, and a different browser.js, likely it
> should be handled apart, in a separate bug.
> 

Filed https://bugzilla.mozilla.org/show_bug.cgi?id=700326 to track matching work for mobile.
the first one is default what we run now, and the second set of numbers is with a 20 second delay before shutting down the browser.  Seems to adjust the tsshutdown numbers by 2.5% more.  Not a big difference, but it is more.

(talos)jmaher@jmaher-ubuntu-mbp:~/mozilla/talos/talos$ cat output/ts_paint_shutdown.csv 
i,value
0,756
1,710
2,698
3,695
4,656
5,685
6,768
7,655
8,616
9,691684 * .04
RETURN: ts_paint_shutdown: 684.67
(talos)jmaher@jmaher-ubuntu-mbp:~/mozilla/talos/talos$ cat output.20/ts_paint_shutdown.csv 
i,value
0,827
1,654
2,689
3,601
4,783
5,683
6,789
7,698
8,694
9,729
RETURN: ts_paint_shutdown: 702.22
(talos)jmaher@jmaher-ubuntu-mbp:~/mozilla/talos/talos$ 


Should we move forward with this?
Well, it happened at least 3 times recently that Ts_shutdown reported noise instead of real regressions, so the alternatives are:
- stay as we are and keep ignoring these results
- change and start caring about these results
the former is practically the same as removing the test.

The 2.5% difference is hard to interpret, since by waiting we are no more accounting startup work in the shutdown path (so shutdown should be smaller), but we are now accounting delayed shutdown work (so shutdown should be larger). So it's (old_time - async_startup_work + delayed_shutdown_work).

If you run multiple ts_paint_shutdown is there a lot of noise? more or less than with the old approach?

PS: I thought we decided for 15 seconds delay, not 20.
I can try to setup 20 or 30 runs this weekend before/after this change to see how solid the differences are.
the document looks protected
oh, sorry- it is fixed now
So after the change shutdown measure is smaller? This confirms some part of startup is being measured in the shutdown test. For example all the async initializations. The problem may be to figure out if we care measuring those and how we plan to do that. For the Storage part, we have STARTUP_ telemetry, at least.
Unfortunately doesn't seem to help noise, I was hoping in better results there :(

So, according to these numbers, if we change the measure, we get more realistic shutdown times, but we lose track of asynchronous startup work.
As it is now, we have some sort of measure of that async work but it's in the shutdown test, though we are effectively measuring the shutdown time of a user that closed the process as soon as it's open, that is pretty much useless.
Both approaches look imperfect still :(
yeah, I was surprised at the numbers in the sense that they were much lower and that there was so much fluctuation after a 15 second sleep.  

I am open for new suggestions.  It isn't too hard to try out some experiments, so don't hesitate to ask :)
is there still anything left to do on this bug?  This has been over 6 months since we did some changes to the ts tests.
if we finally measured negatively an increase in the timeout between statup and shutdown (and comment 16 looks like pointing that), we can problably wontfix this and evaluate different approaches for future.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.