Closed Bug 1249662 Opened 8 years ago Closed 8 years ago

Investigate Puppet error during 2008 golden image creation

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: markco, Assigned: grenade)

Details

(Whiteboard: [windows][aws])

Attachments

(6 files)

Fri Feb 19 03:09:23 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[network_tweak_bat] (err): Failed to call refresh: "C:\Windows\system32\schtasks.exe" /Run /TN SchTsk_netsh returned 1 instead of one of [0]
Fri Feb 19 03:09:23 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[network_tweak_bat] (err): "C:\Windows\system32\schtasks.exe" /Run /TN SchTsk_netsh returned 1 instead of one of [0]
Attached patch Bug1249662.patchSplinter Review
The space in the variable value was causing an error.
Attachment #8721360 - Flags: review?(arich)
Attachment #8721360 - Flags: review?(arich) → review+
Assignee: relops → mcornmesser
To resolve the following re-occurring error on the first Puppet run of 2008 gold AMIs:

Thu Feb 25 02:52:56 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[network_tweak_bat] (err): Failed to call refresh: "C:\Windows\system32\schtasks.exe" /Run /TN SchTsk_netsh returned 1 instead of one of [0]
Thu Feb 25 02:52:56 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[network_tweak_bat] (err): "C:\Windows\system32\schtasks.exe" /Run /TN SchTsk_netsh returned 1 instead of one of [0]

What I think is happening is that when the concated network tweak script is modified a file lock is placed on the file. The lock is still in place the first time the script is called (when the schedule task SchTsk_netsh is called). With the retries it will execute after the lock is removed.
Attachment #8723611 - Flags: review?(rthijssen)
Comment on attachment 8723611 [details] [diff] [review]
bug1249662-2.patch

Review of attachment 8723611 [details] [diff] [review]:
-----------------------------------------------------------------

ship it!
Attachment #8723611 - Flags: review?(rthijssen) → review+
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
grenade: Could you remove the SchTsk_netsh schedule task through Userdata?

What was happening was the when the concatted file change it it wasn't able to execute the scheduled task. I have added a clean up to remove the schedule task, so it can be created and ran when the file changes. For the existing AMIs though we will need to remove the task.
Attachment #8725758 - Flags: review?(rthijssen)
Attachment #8725758 - Flags: review?(rthijssen) → review+
Attachment #8726136 - Flags: review?(mcornmesser) → review+
This still isn't executing the new the new network_tweak.bat:

Tue Mar 08 03:14:25 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[netsh_error_check] (err): Failed to call refresh: C:/etc/netsh_check.bat returned 0 instead of one of [1]
Tue Mar 08 03:14:25 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[netsh_error_check] (err): C:/etc/netsh_check.bat returned 0 instead of one of [1]
I am unable to recreate the error from comment 9 in testing. Which means there may be something with in the environment of the userdata that is causing the error. 

grenade: Any thoughts? Alternatively we can just remove the log file and revisit this in the future if we need to make further changes using netsh commands.
Flags: needinfo?(rthijssen)
I reproduced on a spot instance:

C:\Users\Administrator>c:\etc\netsh_check.bat

C:\Users\Administrator>REM  This Source Code Form is subject to the terms of the
 Mozilla Public

C:\Users\Administrator>REM License, v. 2.0. If a copy of the MPL was not distrib
uted with this

C:\Users\Administrator>REM file, You can obtain one at http://mozilla.org/MPL/2.
0/.

C:\Users\Administrator>REM Remove schedule task, so that it could be recreated i
f the network_tweak.bat changes

C:\Users\Administrator>REM Ref: https://bugzilla.mozilla.org/show_bug.cgi?id=124
9662

C:\Users\Administrator>schtasks.exe /Delete  /TN SchTsk_netsh /F
ERROR: The system cannot find the file specified.

C:\Users\Administrator>REM Checking for log file that is generated on error of n
etwork_tweak.bat

C:\Users\Administrator>dir C:\ProgramData\PuppetLabs\puppet\var\log\netsh_error.
log /Q
 Volume in drive C has no label.
 Volume Serial Number is 0EA8-1712

 Directory of C:\ProgramData\PuppetLabs\puppet\var\log

11/04/2015  12:15 PM               135 BUILTIN\Administrators netsh_error.log
               1 File(s)            135 bytes
               0 Dir(s)  59,265,245,184 bytes free
C:\Users\Administrator>echo %errorlevel%
0
Flags: needinfo?(rthijssen)
The puppet call executes the batch file at c:\etc\netsh_check.bat and then checks for an error code of 1 (failure) but running the bat file returns an error code of 0 (success). Puppet just reports what happened. We should either update Puppets expectations (to expect 0), or edit the bat file in a way that would make it return an error code of 1.

Either way, the puppet result to complain in the way it currently does, is perfectly logical.

The second command in the bat file is to query the existence of a specific log file, which it finds and lists successfully. The error code for this command is 0 and that is what the bat file returns. If what is actually wanted is the error code from the first command (which tries to delete a scheduled task that doesn't exist), then the batch file should be updated to capture and return %errorlevel% from that call.
This is actually way complicated because the Puppet agent was able to call the netsh commands directly, so there are multiple parts. 

The first part is Puppet concats (based on OS type and location) the network_tweak.bat:

REM remove old log file/semaphore
del C:\ProgramData\PuppetLabs\puppet\var\log\netsh_error.log /Q
set netcmd=C:\windows\System32\netsh.exe int tcp set global netdma=enabled
 %netcmd%
 If %errorlevel% neq 0 echo %netcmd% failed with  exit code %errorlevel% >>C:\ProgramData\PuppetLabs\puppet\var\log\netsh_error.log
set netcmd=C:\windows\System32\netsh.exe int tcp set global congestionprovider=ctcp

The script runs the command then redirects/creates the log on error. 

Because of the issue of the Puppet agent running the netsh commands we then create a schedule task and run a schedule task to execute the script. Which then creates the issue that if a command fails it is not communicated back to the Puppet agent. To work around that is the reason why I have the log file created on error. 

The final part of this is the netsh_check.bat. It was a conveniently place to remove the schedule task, but more importantly it checks for the existence of the log file that is created on error. If everything worked the last exit code of the check script will be 1. An exitcode of 0 would mean the log file exists, and that there was one or more failures in the network_tweak.bat.

Where we are now. I can take the base ami run Puppet manually and have no errors. During the golden ami creation we get the error from comment 9. Looking at the spot instance it has the latest version of the network_tweak.bat, but the log file at C:\ProgramData\PuppetLabs\puppet\var\log\netsh_error.log is from 2015-11. Which leads me to believe that the schedule task is not being executed during the Puppet run from Userdata, or the deletion of the of the log file in the network_tweak.bat is failing. 

grenade: sorry for the over complication of this. There was an incredible amount failures that lead to it being setup like this. Any other thoughts on what could be happening? Alternatively we can delete the old log file pre-Puppet run, and revisit this is if we ever change the contents of the network_tweak.bat.
Flags: needinfo?(rthijssen)
ok, if i understood correctly there is a log file that exists before the puppet run which shouldn't. this pr to cloud-tools removes the file (if it exists) before starting the puppet run.
Flags: needinfo?(rthijssen)
Attachment #8729094 - Flags: review?(mcornmesser)
Comment on attachment 8729094 [details] [review]
https://github.com/mozilla/build-cloud-tools/pull/194

Looks good. I am concerned about the Travis check fail though:

ERROR: invocation failed, logfile: /home/travis/build/mozilla/build-cloud-tools/.tox/py27/log/py27-2.log

ERROR: actionid=py27

R+ as long as that is none impacting.
Attachment #8729094 - Flags: review?(mcornmesser) → review+
Flags: needinfo?(rthijssen)
Whiteboard: [windows][aws]
It turns out that b-2008 golden instances contain a truncated hostname in their COMPUTERNAME environment variable. This confuses the puppet cert issue process as well as the checks for a successful cert download.

I believe this pr addresses both issues.
Flags: needinfo?(rthijssen)
Attachment #8749167 - Flags: review?(mcornmesser)
Attachment #8749167 - Flags: review?(mcornmesser) → review+
the currently running userdata has failed to obtain a cert from ca with the error message:

C:\ProgramData\PuppetLabs\puppet\var\puppettize_TEMP.vbs(88, 1) Microsoft VBScript runtime error: Input past end of file

new userdata from pr 213 has correctly detected the cert download failure (it didn't used to, before it would only detect puppet run failures):

2016-05-05 08:16:29 -07:00 [DEBUG] Install-Certificates :: Function started
2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: applied deploy pass
2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\certs\ca.pem
2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\certs\b-2008-ec2-golden.build.releng.use1.mozilla.com.pem
2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\private_keys\b-2008-ec2-golden.build.releng.use1.mozilla.com.pem
2016-05-05 08:16:36 -07:00 [INFO] Install-Certificates :: installing certificates
2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\private_keys
2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: created C:\ProgramData\PuppetLabs\puppet\var\ssl\private_keys
2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\certs
2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: created C:\ProgramData\PuppetLabs\puppet\var\ssl\certs
2016-05-05 08:16:54 -07:00 [INFO] Install-Certificates :: puppettize vbs run completed
2016-05-05 08:16:54 -07:00 [Error] Install-Certificates :: missing cert detected after puppetize vbs run (C:\ProgramData\PuppetLabs\puppet\var\ssl\certs\ca.pem)
2016-05-05 08:16:54 -07:00 [Error] Install-Certificates :: missing cert detected after puppetize vbs run (C:\ProgramData\PuppetLabs\puppet\var\ssl\certs\b-2008-ec2-golden.build.releng.use1.mozilla.com.pem)
2016-05-05 08:16:54 -07:00 [Error] Install-Certificates :: missing cert detected after puppetize vbs run (C:\ProgramData\PuppetLabs\puppet\var\ssl\private_keys\b-2008-ec2-golden.build.releng.use1.mozilla.com.pem)
2016-05-05 08:16:54 -07:00 [DEBUG] Install-Certificates :: Function ended
2016-05-05 08:16:54 -07:00 [ERROR] Run-Puppet :: detected puppet certificate installation failure
2016-05-05 08:16:54 -07:00 [DEBUG] Run-Puppet :: retry in 30 minutes...

so I don't undertand why the puppetize run is failing, but we'll see (by letting userdata loop on cert download) if subsequent attempts are any better. Note that before we didn't reattempt cert downloads automatically, we only reattempted puppet runs. so the situation is slightly improved but I'm none the wiser about why puppettize didn't successfully download the certs on the first attempt.
Assignee: mcornmesser → rthijssen
fixed here: https://github.com/mozilla/build-cloud-tools/commit/6b180d63bee43de1a1ab834a3739194563d0b9f2
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: