Wrapper script for TalosStart.bat.

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: markco, Assigned: markco)

Tracking

Details

Attachments

(1 attachment)

(Assignee)

Description

4 years ago
To prevent Talos from starting before the Puppet run completes from start up, the PuppetRun.bat had to be changed and TalosWrap.bat was written.

PuppetRun.bat:

cd "C:\Program Files\Puppet Labs\Puppet\bin"

cmd /c puppet agent --test --environment=mcornmesser --server=releng-puppet2.srv.releng.scl3.mozilla.com>>C:\programdata\PuppetLabs\puppet\var\log\pupptlog.log 2<&1

echo %DATE%%TIME% END>"C:\programdata\PuppetLabs\puppet\var\log\PuppetComplete.txt"

At the end of the .bat added a file creation,PuppetComplete.txt. This file is removed upon shutdown threw a GPO shutdown script. Is the file is being used as a flag to check to see if the Puppet run complete. Which will be shown in the the TalosWrap.bat. 

As a side note changed the logging location to a more proper directory.

TalosWrap.bat:

echo off
echo Waiting for puppet to complete 

 

:FlagFile1
echo Checking for completetion of PuppetRun.bat Attempt 1... 15 seconds 
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt (GOTO RubyCheck)
sleep 15 
GOTO FlagFile2

:FlagFile2
echo Checking for completetion of PuppetRun.bat Attempt 2... 45 seconds 
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt (GOTO RubyCheck)
sleep 45 
GOTO FlagFile3

:FlagFile3
echo Checking for completetion of PuppetRun.bat Final Attempt... 60 seconds 
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt (GOTO RubyCheck)
sleep 60
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt (GOTO RubyCheck)
GOTO FAIL

echo Puppet may not have completed starting startTalos.bat
echo %DATE%%TIME% PuppetRun.bat failed to complete>> C:\ProgramData\PuppetLabs\puppet\var\log\fail.log 2<&1

:RubyCheck
echo Last check that Puppet is no longer running.
tasklist /fi "imagename eq ruby.exe" |find "ruby.exe" 
if errorlevel 1 GOTO TALOS
echo %DATE%%TIME% Ruby.exe failed to terminate>C:\ProgramData\PuppetLabs\puppet\var\log\fail.log 2<&1



:FAIL
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\rebooted.log (GOTO END)
echo A single reboot will be performed to see if Puppet will complete.
echo %DATE%%TIME% restart attempt>C:\ProgramData\PuppetLabs\puppet\var\log\rebooted.log
shutdown /r /t 0 /f


:Talos
echo Kicking off TalosStart.bat 
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\rebooted.log (del /f C:\ProgramData\PuppetLabs\puppet\var\log\rebooted.log)
start /wait cmd /c C:\slave\startTalos.bat
GOTO Finish

:END
Echo After 1 reboot attempt Puppet has seemed to fail. Start Talos has been suspended!!!!
pause
exit

:Finish 
Echo It's all good.
sleep 45
exit

This script first checks for the flag file for totally of 2 minutes after logon. Keep in mind that the  Puppet run start through a schedule task at start up, so this allows a reasonable window for puppet to complete, and it will move onto the next step if it finds it on one of the earlier checks. 

If it find the files it will then check to see if the Ruby process is running. This is more or less just a safety step to check to see if Puppet is making any additional changes or is hung for any reason. If the condition of the file and process is met it will continue to start the TalosStart,bat. 

If it fails all the file checks or if Ruby is still running it will fail and write a log with the time to the logging directory. It will also write a flag file to the logging directory. That if exists, the script will not reboot the machine a second time if it fails to meet any of the conditions. If Puppet doesn't complete on the second reboot the script will pause and state, "After 1 reboot attempt Puppet has seemed to fail. Start Talos has been suspended!!!!"

A TODO on this script is for the script to report out if it ultimately fails out, but for now meets the needs.
(Assignee)

Updated

4 years ago
Assignee: relops → mcornmesser
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Depends on: 891982
Resolution: --- → FIXED
(Assignee)

Comment 1

4 years ago
This is also set up to replace the start talos schedule task in the Schedule_tasks_testers GPO. It is set up to trigger on cltbld login. Currently it is restricted through item level targeting to only be apply to machines in the Win 7 pup_test OU.
Created attachment 8346548 [details]
OS X Startup Scripts

This is substantially parallel to the structure we use on POSIX, but differs in a few important ways.  In particular, it's similar to the launchd scripts for OS X, which like Scheduled Tasks doesn't have a way to ensure that one task ends before the next begins.  Attached are the OS X launchd scripts and shell script.  As you can see, launchd can watch a semaphore (/var/tmp/semaphore/run-buildbot), rather than having to write that watching in a script, but overall the effect is the same: run puppet until success, create a semaphore, run buildbot.

The most substantial difference is that on POSIX, the puppet script retries automatically (run_until_success), only rebooting after about 2h of trying.  This automatic reboot used to catch the case where old versions of Fedora wouldn't get an IP on startup, but it occurs very rarely these days.

The mission-critical bit here is this: buildbot cannot start until there is a *successful* run of puppet.

The other difference is around the semaphore and buildbot startup.  The Buildbot script should wait forever for that semaphore -- if there's a puppet problem that causes the successful puppet run to take two hours, we don't want startTalos to have given up, as that will mean we're manually touching hundreds of machines.  Never give up.  Never surrender!

Finally, the handling of the semaphore file can be safer.  An on-shutdown task won't run if there's a power-failure or abnormal system termination, meaning that on the next startup, startTalos will think that the puppet run is complete and not wait for a new run -- very bad, as it will start Talos early and then run talos tests while puppet is also running, resulting in incorrect performance data.

The better model would be to remove that file *both* at the beginning of PuppetRun.bat and just after it's detected in startTalos.bat.  It'd be even better to have something run earlier in startup (perhaps a startup script, since AIUI these scheduled tasks run after login?) that removes the semaphore.

Finally, while we need to get these files and scheduled tasks installed during system installation, we should pretty quickly transition to managing them with puppet itself.  That will definitely come later, after we land the initial puppet-on-windows patch.
(Assignee)

Updated

4 years ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 3

4 years ago
Made some adjustments to the script. It continue to check for the semaphore and that ruby.exe is no longer running for a duration of 2 hours, and there is no longer a fail out. It will continuous check and reboot in an effort to start Talos. Also the semaphore is being removed in the same step as the starttalos.bat.  


echo off
echo Waiting for puppet to complete 

rem Get start time:
for /F "tokens=1-4 delims=:.," %%a in ("%time%") do (
   set /A "start=(((%%a*60)+1%%b %% 100)*60+1%%c %% 100)*100+1%%d %% 100"
)
 

:FileCheck
echo Checking for completetion of PuppetRun.bat.
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt (GOTO RubyCheck)
echo PuppetRun.bat has not completed. WIll check again in 15 seconds.
sleep 15

rem Get end time:
for /F "tokens=1-4 delims=:.," %%a in ("%time%") do (
   set /A "end=(((%%a*60)+1%%b %% 100)*60+1%%c %% 100)*100+1%%d %% 100"
)

rem Get elapsed time:
set /A elapsed=((end-start)/10)/10

IF %elapsed% lss 7200 GOTO FILECHECK (
IF %elapsed% gtr 7200 GOTO FAIL ( 

:RubyCheck
echo Checking that Ruby.exe is no longer running.
tasklist /fi "imagename eq ruby.exe" |find "ruby.exe" 
if errorlevel 1 GOTO TALOS
echo Ruby.exe has not completed. WIll check again in 15 seconds.
sleep 15


rem Get elapsed time:
set /A elapsed=((end-start)/10)/10

IF %elapsed% lss 7200 GOTO RubyCheck (
IF %elapsed% gtr 7200 GOTO FAIL ( 

echo %DATE%%TIME% Ruby.exe failed to terminate>C:\ProgramData\PuppetLabs\puppet\var\log\fail.log 2<&1

:FAIL
echo A single reboot will be performed to see if Puppet will complete.
echo %DATE%%TIME% Puppet has seemed to fail. Rebooting the machine in an effort to recover.>>C:\programdata\PuppetLabs\puppet\var\log\pupptlog.log
shutdown /r /t 0 /f


:Talos
echo Kicking off TalosStart.bat 
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt (del /f C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt)
start /wait cmd /c C:\slave\startTalos.bat
Echo It's all good.
sleep 10
exit
This doesn't seem to address:

(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #2)
> The other difference is around the semaphore and buildbot startup.  The
> Buildbot script should wait forever for that semaphore -- if there's a
> puppet problem that causes the successful puppet run to take two hours, we
> don't want startTalos to have given up, as that will mean we're manually
> touching hundreds of machines.  Never give up.  Never surrender!

Because the buildbot script is still timing out and rebooting eventually.  It really shouldn't do that - it could end up rebooting in the middle of a puppet run.
(Assignee)

Comment 5

4 years ago
Should the reboot be removed entirely? I left the reboot in for consistency across platforms, and that it is possible that Puppet can run correctly after a reboot. 

Or should I place check in the reboot to see if the ruby.exe is running, and have the machine reboot only after 2 hours have passed, there is no semaphore, and ruby.exe is not running?
I'm trying to be consistent across platforms, too :)

The buildbot startup code *never* reboots the system on any of the POSIX platforms.  All of the logic about puppet is in the puppet script, which is quite lengthy as a result (and attached to comment 3).  The code to run the buildslave is pretty short - https://github.com/mozilla/build-puppet/blob/master/modules/buildslave/files/darwin-run-buildslave.sh

The only differences on Windows would be:
 * batch, not shell (ok, that's a big difference!)
 * the run-buildslave script has to implement the wait-forever-for-puppet bit, which launchd takes care of on OS X, manually.  But it really should be an infinite loop.
(Assignee)

Comment 7

4 years ago
Oh OK I was tripping on:
> The most substantial difference is that on POSIX, the puppet script retries
> automatically (run_until_success), only rebooting after about 2h of trying. 
> This automatic reboot used to catch the case where old versions of Fedora
> wouldn't get an IP on startup, but it occurs very rarely these days.

Which something like that should go into the PuppetRun.bat instead of the TalosWrap.bat.

The TallosWrap.bat should be more like this:
echo off
echo Waiting for puppet to complete 

:FileCheck
echo Checking for completetion of PuppetRun.bat.
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt (GOTO RubyCheck)
echo PuppetRun.bat has not completed. WIll check again in 15 seconds.
sleep 15
GoTO FileCheck

:RubyCheck
echo Checking that Ruby.exe is no longer running.
tasklist /fi "imagename eq ruby.exe" |find "ruby.exe" 
if errorlevel 1 GOTO Talos
echo Ruby.exe has not completed. WIll check again in 15 seconds.
sleep 15
GOTO RubyCheck


:Talos
echo Kicking off TalosStart.bat 
IF exist C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt (del /f C:\ProgramData\PuppetLabs\puppet\var\log\PuppetComplete.txt)
start /wait cmd /c C:\slave\startTalos.bat
Echo It's all good.
sleep 10
exit
That looks great
(Assignee)

Updated

4 years ago
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.