Closed Bug 468680 Opened 13 years ago Closed 13 years ago

unthrottle talos winxp, vista and ubuntu boxes.

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: anodelman)

Details

Due to the ending of testing on the firefox2 line we are no longer concerned with timing granularity (which was only ever an issue on winxp - and possibly vista).  We initially throttled to get better numbers on windows machines and then were forced to throttle more generally to be able to compare numbers between platforms.

If throttling is no longer required on winxp then we can discontinue throttling talos boxes across the board.
As mentioned on dev.planning, my crude testing of Mac 10.5 and Vista seems to back up the notion that Date.now()-based timing has millisecond granularity on trunk, at least.
We should probably start by unthrottling half of the boxes, to see what it does to both noise and real changes over a little while.
Before we do any large scale change overs we need:

1 - some comment assuring that the issues for which bug 393940 was required are, indeed, fixed - this is the first bug relating to throttling talos so it would be nice for a comment here from vlad.
2 - a general understand that if we are to disable throttling then we will be unable to compare current performance results with historic values from ff2.  I don't think that this is much of a loss, but developers should be aware that backwards comparison will be out (especially if the js timer resolution bug is still present in ff2, meaning that we can't go back and re-test a selection of builds).
3 - testing on staging to ensure that we get results that are at least as consistent as that which are currently recorded
4 - roll out plan and scheduled downtime
This is one way to "make Talos faster". Splitting Talos test suites to run on different machines in parallel might be another way. Eliminating queuing time by having multiple slaves in a pool might be yet another way. 

We're still gathering requirements, and figuring out the consequences of questions in comment#3, so no work can start here yet. Moving to Future while we figure out if this is something we should do or not.
Component: Release Engineering: Talos → Release Engineering: Future
Starting the initial work here of turning off throttling on talos staging machines to see what happens to number quality.
Assignee: nobody → anodelman
Priority: -- → P2
Numbers look good on vista/ubuntu.  Still investigating winxp to determine if the wonky results are due to the staging box being in a weird state or to the de-throttling.
Alice suggested on IRC that we start by rolling this out for the 1.9 branch; the only real risk there is that before/after comparisons between builds there and builds in 1.9.1 won't be possible anymore.

I think I'll post to dev-planning to gather input on that plan.
This sounds great!  What did it do to our test run time?
Yep, definitely sounds great, as long as we can still get precise numbers.  Are there any datasets available now that we can use to compare to the throttled numbers?
I've unthrottled the staging talos boxes, so if you look on graphs-stage.mozilla.org for qm-pxp-stage01/qm-pubuntu-stage01/qm-pvista-stage01 you should see the fall off in numbers.  From my look over the results they look just as good as they did when throttled... just faster.

The staging talos boxes take a beating so while they do give us a good idea that nothing is going to break and the numbers will be fine, they don't reflect exactly the tests/setup in production.  I would think that we would be seeing test cycle times on par with leopard/tiger - as they are unthrottled.  A quick glance at the waterfall shows leopard machines completing all tests in about an hour, which would be about 25-30 minutes faster than a winxp run.  From this I would expect a 25-30 minute speed up across the newly unthrottled machines.
Planning to unthrottle during downtime scheduled for morning of Friday (23rd Jan). (This also announced on dev.planning.)
Firefox3.0 unthrottled.  Pretty much lose 10 minutes off the full test cycle time of winxp/vista/linux - I was really hoping for a bigger bang than that.  The slowest test is still Tp, on winxp it takes just over an hour to complete.

before (minutes/machine)
71 qm-mini-ubuntu01  
71 qm-mini-ubuntu02
71 qm-mini-ubuntu05 
68 qm-mini-ubuntu03 
after (minutes/machine)
59 qm-mini-ubuntu01 
59 qm-mini-ubuntu02 
60 qm-mini-ubuntu05 
58 qm-mini-ubuntu03 

before (minutes/machine)
90 qm-mini-xp01 
90 qm-mini-xp02 
90 qm-mini-xp03 
87 qm-mini-xp05 
after (minutes/machine)
81 qm-mini-xp01 
82 qm-mini-xp02 
82 qm-mini-xp03 
80 qm-mini-xp05 

before (minutes/machine)
92 qm-mini-vista01 
95 qm-mini-vista02 
94 qm-mini-vista03 
87 qm-mini-vista05 
after (minutes/machine)
84 qm-mini-vista01 
83 qm-mini-vista02 
85 qm-mini-vista03 
82 qm-mini-vista05
I think that the benefits of unthrottling are enough that, even though we are only gaining 10 minutes per test cycle, we should roll it out across all talos boxes.

You can see the results here:

https://wiki.mozilla.org/Buildbot/Talos/Machines#1.8_.26_1.9_.28Firefox3.0_.26_Mozilla1.8.29

The winxp numbers appear to have less variance post-unthrottling, the linux numbers seem to have slightly greater variance post-unthrottling.  Vista is unchanged.
We'll gain an extra couple of minutes in terms of cycle time on ubuntu Talos boxes by using the 'performance' cpu governor (I've been going to the default 'ondemand' governor when unthrottling boxes).  I've tested on stage and this does no degrade the quality of performance results.

I'll include the change in the downtime for rolling out unthrottling across all Talos boxes.
All Talos boxes now unthrottled as of downtime this morning.

Numbers look good.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
The change from 'ondemand' to 'performance' on the talos ubuntu testers appears to have resolved the increased variance observed in comment 13.  This means that we are getting as consistent numbers as we were before unthrottling across all platforms.
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.