Closed Bug 463020 Opened 13 years ago Closed 13 years ago

Talos machines should be automatically rebooted periodically

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(4 files, 3 obsolete files)

After Talos machines have been up for a while, the performance results start to drift.  Rebooting the machines seems to fix this problem.

We should be rebooting the Talos machines regularly so that this drift is not a problem.  Two approaches are:

- Reboot after every n Talos runs

- Reboot if uptime > m

Other issues to solve:

- How to perform the reboot?  The cltbld account normally doesn't have permission to do this.  On Linux and Mac we could give sudo access.

- How to prevent buildbot from freaking out?  If the reboot is performed inside a buildbot job, then that job will fail, and then the build slave will lose connectivity.
(I've been thinking about this a lot lately, so bear with me)

I think the "right" way to do this is from inside of Buildbot. It's the only way we can be *certain* we don't interrupt a running job.

Rebooting based on uptime is going to be tough. AFAIK we don't have any unix facilities (cygwin, msys, et. al) on the Windows Talos machines - so we can't use a simple bash oneliner to do it.

Rebooting after every N runs is pretty easy. It'll require a custom BuildStep, but a trivial one:

class Reboot(ShellCommand):
    def start(self):
        buildNum = self.step_status.getBuild().getNumber()
        # clobber every 25 builds
        if (buildNum % 25) != 0:
             return SKIPPED
        ShellCommand.start(self)

...and the TalosFactory would call it thusly:
self.addStep(Reboot, command=['shutdown', '-r', '-t', '0'], flunkOnFailure=False)

(with the proper variant per platform.)

I'm probably missing a detail or two in the step, but I've used the same logic for conditional clobbers on a non-Mozilla Buildbot. The flunkOnFailure is important here -- it will make sure that the build won't turn red when the slave *does* reboot.

One thing I'm not sure of here is whether or not the master will try to give the ghost'ed slave another job before it rejoin.

Just my thoughts, take them fwiw!
Attached patch add call to count_and_reboot.py (obsolete) — Splinter Review
Attachment #346470 - Attachment mime type: application/octet-stream → text/plain
Duplicate of this bug: 379234
Priority: -- → P2
looking on http://qm-buildbot01.mozilla.org:2008, it seems that the slaves 

qm-pxp-stage01
qm-pvista-stage01
qm-pubuntu-stage01
qm-ptiger-stage01
qm-pleopard-stage01

are for staging only, and could be used to see if this auto-restarting patch
works.
http://graphs-stage.mozilla.org/#show=395345,395333,395459,395497,395511

machines are rebooting after every 3 talos runs since nov 12th ~12pm
Attached patch add call to count_and_reboot.py (obsolete) — Splinter Review
Attachment #346468 - Attachment is obsolete: true
Attachment #348603 - Flags: review?(anodelman)
Once these are approved, let's make these live on Firefox 3.0 production.

These machines are:
qm-plinux-fast01 (fast)
qm-mini-ubuntu03 (nochrome)
qm-mini-ubuntu01
qm-mini-ubuntu02
qm-mini-ubuntu05

qm-pmac-fast01 (fast)
qm-pmac05 (nochrome)
qm-pmac01
qm-pmac02
qm-pmac03

qm-pleopard-trunk07
qm-pleopard-trunk08

qm-pxp-fast01 (fast)
qm-pxp-jss01 (jss)
qm-pxp-jss02 (jss)
qm-pxp-jss03 (jss)

qm-mini-xp05 (nochrome)
qm-mini-xp01
qm-mini-xp02
qm-mini-xp03

qm-mini-vista05 (nochrome)
qm-mini-vista01
qm-mini-vista02
qm-mini-vista03
qm-pxp-
ignore that last 'qm-pxp-'

on the mac and linux machines, we need to add this line to /etc/sudoers:

mozqa	ALL=NOPASSWD: /sbin/reboot
You're missing qm-pleopard-trunk06 from your list.
Attachment #348604 - Flags: review?(anodelman) → review+
Comment on attachment 348603 [details] [diff] [review]
add call to count_and_reboot.py

The code is fine - the downside is that this will affect all production boxes.  We need an interim patch that will only end up touching the production Firefox3.0 machines so that we can do a proper test before a full roll out.
Attachment #348603 - Flags: review?(anodelman) → review-
qm-pubuntu-stage01 and qm-pvista-stage01 both failed to come back after an automated reboot.  We need to figure out why before moving forward with this.
I think that we can still move forward with a patch limited to affecting Firefox3.0 so that we can get a real feel for how often (or if) machines don't come back from reboot.

This will require a new patch and a plan for applying the necessary sudoers change to the linux/mac machines.  I think that we can fold this into the already scheduled downtime for next monday.
Attachment #348603 - Attachment is obsolete: true
Attachment #350184 - Flags: review?(anodelman)
Comment on attachment 350184 [details] [diff] [review]
add call to count_and_reboot.py

This is a good way around this - though, we could also go with passing in an extra variable for turning rebooting on/off in case we want to do this per-factory.  But, not necessary for this in between test phase.
Attachment #350184 - Flags: review?(anodelman) → review+
I've updated all the mac and linux machines except for qm-pleopard-trunk06,08 which are down right now (see bug #466889)
Depends on: 466889
qm-pleopard-trunk06 and qm-pleopard-trunk08 now have sudoers updated
Attachment #348604 - Flags: checked‑in+
Attachment #350184 - Flags: checked‑in+
This has been put into production for FF3.0 machines
Duplicate of this bug: 419620
Update:

- in production for Firefox3.0
- 3 winxp machines fell over (bug 467608) - no diagnoses yet
- 2 mac machines fell over (bug 467568) - appears to be caused by a fixable configuration issue

This will continue to bake on Firefox3.0 till we've worked out the snags.
Duplicate of this bug: 467791
Has anyone investigated whether the drift is due to the Windows fastload files? It may not be but it might be worth a shot.
Duplicate of this bug: 467797
Duplicate of this bug: 467796
(In reply to comment #23)
> Has anyone investigated whether the drift is due to the Windows fastload files?
> It may not be but it might be worth a shot.

I would hope that we're starting with a new profile every time.


I'm somewhat disappointed that any bug about machine inconsistency is getting duped to this bug; we really ought to have stable numbers, not slightly increasing and then going down again when we reboot.  If rebooting is really necessary to get stable numbers, then it seems like we should be doing it *every* run.
(In reply to comment #26)
> (In reply to comment #23)
> > Has anyone investigated whether the drift is due to the Windows fastload files?
> > It may not be but it might be worth a shot.
> 
> I would hope that we're starting with a new profile every time.
> 
> 
> I'm somewhat disappointed that any bug about machine inconsistency is getting
> duped to this bug; we really ought to have stable numbers, not slightly
> increasing and then going down again when we reboot.  If rebooting is really
> necessary to get stable numbers, then it seems like we should be doing it
> *every* run.

Results being impacted by machine uptime shouldn't be surprising.  We're very dependent on all sorts of O/S issues like file system caches and memory fragmentation that we have little control over.

The frequency of reboots will take some time to work out, I imagine.  It could be that the first run post-reboot is always faster or slower, and then results flatten out.  It could be that running a days worth of tests before rebooting doesn't show any significant drift.  We need more data before saying we should reboot before *every* run.
Duplicate of this bug: 467791
If the first run after a reboot is different, then we need to throw the data from it out.  If we push that data to the graph, people will waste time chasing down a regression that doesn't exist.  If the first six are different and then it flattens out, then we need to throw the first six out.  If we reboot every time, at least we're running the same test each time.
Right, but what we don't know is if a machine is sufficiently 'settled' right after a reboot to give meaningful results.  We've only been rebooting regularly on FF3.0 for 2 days now, so give it some time!
(In reply to comment #27)
> Results being impacted by machine uptime shouldn't be surprising.  We're very
> dependent on all sorts of O/S issues like file system caches and memory
> fragmentation that we have little control over.
Where is the data that shows that this randomness is caused by those OS issues?  Until we have that data, we should assume it's something in our code and try to fix it.  In the specific case of bug 467791, we had a conversation with a developer yesterday where he had an idea on what could be causing that issue.
(In reply to comment #26)
> (In reply to comment #23)
> > Has anyone investigated whether the drift is due to the Windows fastload files?
> > It may not be but it might be worth a shot.
> 
> I would hope that we're starting with a new profile every time.
These aren't profile files... iirc the files I am referring to are prefetch files and are located at C:\Windows\Prefetch. For Firefox they would be named FIREFOX.EXE-XXXXXX.pf where XXXXXX I believe is a random hex number.
The goal of this bug is to generate consistent performance results.  It's good testing practice to work from a clean environment - at the high end I would love if we could re-image fresh before each test run.

From my work with talos I've seen the following
- vista numbers for all tests gradually rise along with machine uptime, resolved by rebooting
- leopard boxes show wildly increasing tp times until the box eventually freezes and needs a hard reboot
- leopard/tiger boxes have runaway Terminal apps consuming 100% cpu resulting in series of rapidly failing tests until the box is rebooted
- ubuntu boxes showing fluctuating results, or results consistently too high/too low, resolved by rebooting

Rebooting fixes a lot of problems and gives us far fewer gaps in our knowledge where machines were cycling green but reporting garbage results.  We'll play with the system a bit and if it turns out that rebooting after every single test run is beneficial we'll do it.

As to comment #31, if we are interesting in learning about how firefox behaves after a long period of uptime then we should specifically design a test and a test harness to do that.  We shouldn't be relying on a side effect of the original talos design to be generating useful data for us.
Duplicate of this bug: 467791
- qm-pmac-fast01 went into a fail state post reboot again, seems to be having trouble restarting apache post-reboot
Looking at the graphs from here: 

https://wiki.mozilla.org/Buildbot/Talos/Machines#1.8_.26_1.9_.28Firefox3.0_.26_Mozilla1.8.29

post auto-rebooting:
- vista numbers from different machines are more consistent with each other
- ubuntu numbers that were on a gradual upward swing were corrected and are now consistent 
- winxp results dropped but look like they may be normalizing, will have to keep watching this
- leopard results haven't done a major swing (ie, their periodic major increases in Tp numbers followed by system freeze/crash hasn't happened), but further watching here is warranted 

From my point of view we need to figure out how to ensure that mac boxes come back up and successfully restart apache before starting testing.  Other than that, I'd like to track the numbers for another week to ensure that they remain consistent.  If that all looks good then we should push this out to 1.9.1 and 1.9.2.
(In reply to bug #467791 comment #6)
> Please read https://bugzilla.mozilla.org/show_bug.cgi?id=463020#c33
> 
> Relying on a system state that is dirty as a side effect of other tests isn't a
> good way to test how firefox behaves over the long term.
> 
> Please direct further discussion to 463020.
> 
> *** This bug has been marked as a duplicate of bug 463020 ***
In relation to the prefetch files they optimize loading of an application that is expected not to change as often as the application on talos does... the exact effect this has in relation to prefetch (iirc this is called superfetch on Vista due to the changes made to prefetch on Vista) is unknown and *might* have an adverse effect. It is also important to note that the way firefox behaves over the long term is not entirely representative on talos since the user wouldn't be running a new build - at least potentially - as often as talos does.
Still having issues with apache not starting on talos tiger boxes post-reboot.  Installed this crontab on the affected machines:

@reboot ps -axc | grep httpd || SystemStarter start "Web Server"
*/5 * * * * ps -axc | grep httpd || SystemStarter start "Web Server"

Will give it another day to see if we stop having frozen boxes waiting on web page loads.
Another issue to deal with, screen dimensions on the winxp boxes seem to occasionally return to 800x600 (instead of 1280x1024) upon reboot - resulting in reporting lower than expected numbers.  We'll need a way to ensure that the screen dimensions are correct before beginning testing.
Found a small command line tool for windows that allows us to force screen dimensions upon reboot.  Have installed on all the firefox3.0 winxp machines.
Vista TS, TP3, and TSVG have all significantly increased after the reboot. There was almost a 100% increase in TS when comparing 1.9.2 to 1.9.1 prior to the reboot and now they are approximately the same which leads me to believe that the reboot caused this increase. For the time being I am keeping the tree closed to investigate further.
Also noticed that Vista's TS, TP3, and TSVG on 1.9.2 is also much closer to XP on 1.9.1 and 1.9.2 which also leads me to believe that this is due to the reboot.
Seems to have caused bug 467990.
Regarding the maintenance on the talos machines today:  it would be much better if talos machine maintenance were done during a tree closure such that the cycle before the maintenance and the cycle after it are testing the same code.  This allows performance regressions from the maintenance to be separated from those due to the code (assuming the regressions are large enough to be detected in one cycle, at least, which isn't always the case).

Today there was a closure, but (I was told) the talos runs that would have provided the necessary coverage were stopped in the middle of the runs in some cases.  The talos runs should have been allowed to test the post-closure code before they were stopped.

What happened today can cast unnecessary suspicion on the code that landed before the closure, potentially requiring its authors to go through cycles of backout and relanding.
This was my bad(In reply to comment #44)
> Regarding the maintenance on the talos machines today:  it would be much better
> if talos machine maintenance were done during a tree closure such that the
> cycle before the maintenance and the cycle after it are testing the same code. 
> This allows performance regressions from the maintenance to be separated from
> those due to the code (assuming the regressions are large enough to be detected
> in one cycle, at least, which isn't always the case).
> 
> Today there was a closure, but (I was told) the talos runs that would have
> provided the necessary coverage were stopped in the middle of the runs in some
> cases.  The talos runs should have been allowed to test the post-closure code
> before they were stopped.
> 
> What happened today can cast unnecessary suspicion on the code that landed
> before the closure, potentially requiring its authors to go through cycles of
> backout and relanding.

This was my fault, my apologies.  I wanted to minimize the tree closure period, so interrupted the currently running tests.
Depends on: 468121
Apache still not consistently starting on reboot.  Going to turn off auto-reboot until this can be fixed.
Catlee - can you put a patch together for rebooting after every test run?  The winxp talos boxes are showing more peaks/valleys now that auto-rebooting is on - I'd like the line to be as smooth as possible before moving ahead with this for other machines.
Attachment #352102 - Flags: review?(anodelman)
Attachment #352102 - Flags: review?(anodelman) → review+
Comment on attachment 352102 [details] [diff] [review]
[Checked in]reboot after every test run

Checking in perfrunner.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/perfrunner.py,v  <--  perfrunner.py
new revision: 1.28; previous revision: 1.27
done
Attachment #352102 - Attachment description: reboot after every test run → [Checked in]reboot after every test run
Attachment #352102 - Flags: checked‑in+
qm-plinux-fast01 fell over around 5am this morning (bug 468827)
Now that we are rebooting boxes post every test run the winxp talos Tp results have gotten as steady as they were pre-autorebooting.

I'm happy with the current set up and think that we have enough confidence to roll this out to other branches.
qm-mini-vista01, qm-mini-vista02 both failed to come back this morning.  Bug 469332
qm-plinux-fast01 down again (bug 469404).
Attachment #353517 - Flags: review?(anodelman)
Updated sudoers on:
qm-mini-ubuntu01   
qm-mini-ubuntu02   
qm-mini-ubuntu03   
qm-mini-ubuntu05   
qm-pleopard-stage01
qm-pleopard-talos01
qm-pleopard-talos02
qm-pleopard-talos04
qm-pleopard-trunk01
qm-pleopard-trunk02
qm-pleopard-trunk03
qm-pleopard-trunk04
qm-pleopard-trunk06
qm-pleopard-trunk07
qm-pleopard-trunk08
qm-plinux-fast01   
qm-plinux-fast03   
qm-plinux-fast04   
qm-plinux-talos01  
qm-plinux-talos02  
qm-plinux-talos03  
qm-plinux-talos04  
qm-plinux-trunk01  
qm-plinux-trunk03  
qm-plinux-trunk04  
qm-plinux-trunk05  
qm-plinux-trunk06  
qm-plinux-trunk07  
qm-pmac-fast01     
qm-pmac-fast03     
qm-pmac-fast04     
qm-pmac-talos01    
qm-pmac-talos02    
qm-pmac-talos03    
qm-pmac-talos04    
qm-pmac-trunk01    
qm-pmac-trunk02    
qm-pmac-trunk03    
qm-pmac-trunk07    
qm-pmac-trunk08    
qm-pmac-trunk09    
qm-pmac-trunk10    
qm-pmac01          
qm-pmac02          
qm-pmac03          
qm-pmac05          
qm-ptiger-stage01  
qm-ptiger-try01    
qm-pubuntu-stage01 
qm-pubuntu-try01   

qm-pleopard-talos03 is down, so it couldn't be updated.
Attachment #353517 - Flags: review?(anodelman) → review+
Still missing here:

- updates to sudoers on qm-pleoaprd-talos03
- addition of force resolution program and update to start batch script on winxp talos boxes
- addition of crontrab to restart apache on mac tiger talos boxes
- scheduler downtime to push out the change and watch the results for consistency
Depends on: 470047
Added crontab to:
qm-pmac-fast01
qm-pmac-fast03
qm-pmac-fast04
qm-pmac-talos01
qm-pmac-talos02
qm-pmac-talos03
qm-pmac-talos04
qm-pmac-trunk01
qm-pmac-trunk02
qm-pmac-trunk03
qm-pmac-trunk07
qm-pmac-trunk08
qm-pmac-trunk09
qm-pmac-trunk10
qm-pmac01
qm-pmac02
qm-pmac03
qm-pmac05
qm-ptiger-stage01
qm-ptiger-try01
qm-pleopard-talos03 has updated sudoers file now
Updated winxp talos machines with resolution setting on startup.

qm-pxp-fast01
qm-pxp-fast03
qm-pxp-fast04
qm-mini-xp01
qm-mini-xp02
qm-mini-xp03
qm-mini-xp05
qm-pxp-talos01
qm-pxp-talos02
qm-pxp-talos03
qm-pxp-talos04
qm-pxp-trunk01
qm-pxp-trunk02
qm-pxp-trunk03
qm-pxp-trunk04
qm-pxp-trunk05
qm-pxp-trunk06
qm-pxp-trunk07
qm-pxp-try01
qm-pxp-stage01
also:

qm-pxp-jss01
qm-pxp-jss02
qm-pxp-jss03

This just leaves scheduling a downtime to push this out to production.
Comment on attachment 353517 [details] [diff] [review]
[Checked in]Enable rebooting on all branches

Checking in perfrunner.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/perfrunner.py,v  <--  perfrunner.py
new revision: 1.29; previous revision: 1.28
done
Attachment #353517 - Attachment description: Enable rebooting on all branches → [Checked in]Enable rebooting on all branches
Attachment #353517 - Flags: checked‑in+
Pushed out change to production during downtime this afternoon.  Numbers look good, will continue to check them periodically over the next week to ensure that things stay stable.

Otherwise, all done here.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Release Engineering: Talos → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.