Closed
Bug 1249662
Opened 8 years ago
Closed 8 years ago
Investigate Puppet error during 2008 golden image creation
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Infrastructure & Operations
RelOps: Puppet
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: markco, Assigned: grenade)
Details
(Whiteboard: [windows][aws])
Attachments
(6 files)
1.02 KB,
patch
|
arich
:
review+
markco
:
checked-in+
|
Details | Diff | Splinter Review |
1019 bytes,
patch
|
grenade
:
review+
markco
:
checked-in+
|
Details | Diff | Splinter Review |
2.09 KB,
patch
|
grenade
:
review+
markco
:
checked-in+
|
Details | Diff | Splinter Review |
53 bytes,
text/x-github-pull-request
|
markco
:
review+
|
Details | Review |
53 bytes,
text/x-github-pull-request
|
markco
:
review+
|
Details | Review |
55 bytes,
text/x-github-pull-request
|
markco
:
review+
|
Details | Review |
Fri Feb 19 03:09:23 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[network_tweak_bat] (err): Failed to call refresh: "C:\Windows\system32\schtasks.exe" /Run /TN SchTsk_netsh returned 1 instead of one of [0] Fri Feb 19 03:09:23 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[network_tweak_bat] (err): "C:\Windows\system32\schtasks.exe" /Run /TN SchTsk_netsh returned 1 instead of one of [0]
Reporter | ||
Comment 1•8 years ago
|
||
The space in the variable value was causing an error.
Attachment #8721360 -
Flags: review?(arich)
Updated•8 years ago
|
Attachment #8721360 -
Flags: review?(arich) → review+
Reporter | ||
Updated•8 years ago
|
Assignee: relops → mcornmesser
Reporter | ||
Comment 2•8 years ago
|
||
Comment on attachment 8721360 [details] [diff] [review] Bug1249662.patch https://hg.mozilla.org/build/puppet/rev/c0875d917ea0
Attachment #8721360 -
Flags: checked-in+
Reporter | ||
Comment 3•8 years ago
|
||
To resolve the following re-occurring error on the first Puppet run of 2008 gold AMIs: Thu Feb 25 02:52:56 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[network_tweak_bat] (err): Failed to call refresh: "C:\Windows\system32\schtasks.exe" /Run /TN SchTsk_netsh returned 1 instead of one of [0] Thu Feb 25 02:52:56 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[network_tweak_bat] (err): "C:\Windows\system32\schtasks.exe" /Run /TN SchTsk_netsh returned 1 instead of one of [0] What I think is happening is that when the concated network tweak script is modified a file lock is placed on the file. The lock is still in place the first time the script is called (when the schedule task SchTsk_netsh is called). With the retries it will execute after the lock is removed.
Attachment #8723611 -
Flags: review?(rthijssen)
Assignee | ||
Comment 4•8 years ago
|
||
Comment on attachment 8723611 [details] [diff] [review] bug1249662-2.patch Review of attachment 8723611 [details] [diff] [review]: ----------------------------------------------------------------- ship it!
Attachment #8723611 -
Flags: review?(rthijssen) → review+
Reporter | ||
Comment 5•8 years ago
|
||
Comment on attachment 8723611 [details] [diff] [review] bug1249662-2.patch https://hg.mozilla.org/build/puppet/rev/c511f8f1a63b
Attachment #8723611 -
Flags: checked-in+
Reporter | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Reporter | ||
Updated•8 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 6•8 years ago
|
||
grenade: Could you remove the SchTsk_netsh schedule task through Userdata? What was happening was the when the concatted file change it it wasn't able to execute the scheduled task. I have added a clean up to remove the schedule task, so it can be created and ran when the file changes. For the existing AMIs though we will need to remove the task.
Attachment #8725758 -
Flags: review?(rthijssen)
Assignee | ||
Updated•8 years ago
|
Attachment #8725758 -
Flags: review?(rthijssen) → review+
Assignee | ||
Comment 7•8 years ago
|
||
Attachment #8726136 -
Flags: review?(mcornmesser)
Reporter | ||
Updated•8 years ago
|
Attachment #8726136 -
Flags: review?(mcornmesser) → review+
Reporter | ||
Comment 8•8 years ago
|
||
Comment on attachment 8725758 [details] [diff] [review] Bug1249662-3.patch https://hg.mozilla.org/build/puppet/rev/51f95aeff02e
Attachment #8725758 -
Flags: checked-in+
Reporter | ||
Comment 9•8 years ago
|
||
This still isn't executing the new the new network_tweak.bat: Tue Mar 08 03:14:25 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[netsh_error_check] (err): Failed to call refresh: C:/etc/netsh_check.bat returned 0 instead of one of [1] Tue Mar 08 03:14:25 -0800 2016 /Stage[main]/Tweaks::Windows_network_opt_netsh/Exec[netsh_error_check] (err): C:/etc/netsh_check.bat returned 0 instead of one of [1]
Reporter | ||
Comment 10•8 years ago
|
||
I am unable to recreate the error from comment 9 in testing. Which means there may be something with in the environment of the userdata that is causing the error. grenade: Any thoughts? Alternatively we can just remove the log file and revisit this in the future if we need to make further changes using netsh commands.
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 11•8 years ago
|
||
I reproduced on a spot instance: C:\Users\Administrator>c:\etc\netsh_check.bat C:\Users\Administrator>REM This Source Code Form is subject to the terms of the Mozilla Public C:\Users\Administrator>REM License, v. 2.0. If a copy of the MPL was not distrib uted with this C:\Users\Administrator>REM file, You can obtain one at http://mozilla.org/MPL/2. 0/. C:\Users\Administrator>REM Remove schedule task, so that it could be recreated i f the network_tweak.bat changes C:\Users\Administrator>REM Ref: https://bugzilla.mozilla.org/show_bug.cgi?id=124 9662 C:\Users\Administrator>schtasks.exe /Delete /TN SchTsk_netsh /F ERROR: The system cannot find the file specified. C:\Users\Administrator>REM Checking for log file that is generated on error of n etwork_tweak.bat C:\Users\Administrator>dir C:\ProgramData\PuppetLabs\puppet\var\log\netsh_error. log /Q Volume in drive C has no label. Volume Serial Number is 0EA8-1712 Directory of C:\ProgramData\PuppetLabs\puppet\var\log 11/04/2015 12:15 PM 135 BUILTIN\Administrators netsh_error.log 1 File(s) 135 bytes 0 Dir(s) 59,265,245,184 bytes free C:\Users\Administrator>echo %errorlevel% 0
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 12•8 years ago
|
||
The puppet call executes the batch file at c:\etc\netsh_check.bat and then checks for an error code of 1 (failure) but running the bat file returns an error code of 0 (success). Puppet just reports what happened. We should either update Puppets expectations (to expect 0), or edit the bat file in a way that would make it return an error code of 1. Either way, the puppet result to complain in the way it currently does, is perfectly logical. The second command in the bat file is to query the existence of a specific log file, which it finds and lists successfully. The error code for this command is 0 and that is what the bat file returns. If what is actually wanted is the error code from the first command (which tries to delete a scheduled task that doesn't exist), then the batch file should be updated to capture and return %errorlevel% from that call.
Reporter | ||
Comment 13•8 years ago
|
||
This is actually way complicated because the Puppet agent was able to call the netsh commands directly, so there are multiple parts. The first part is Puppet concats (based on OS type and location) the network_tweak.bat: REM remove old log file/semaphore del C:\ProgramData\PuppetLabs\puppet\var\log\netsh_error.log /Q set netcmd=C:\windows\System32\netsh.exe int tcp set global netdma=enabled %netcmd% If %errorlevel% neq 0 echo %netcmd% failed with exit code %errorlevel% >>C:\ProgramData\PuppetLabs\puppet\var\log\netsh_error.log set netcmd=C:\windows\System32\netsh.exe int tcp set global congestionprovider=ctcp The script runs the command then redirects/creates the log on error. Because of the issue of the Puppet agent running the netsh commands we then create a schedule task and run a schedule task to execute the script. Which then creates the issue that if a command fails it is not communicated back to the Puppet agent. To work around that is the reason why I have the log file created on error. The final part of this is the netsh_check.bat. It was a conveniently place to remove the schedule task, but more importantly it checks for the existence of the log file that is created on error. If everything worked the last exit code of the check script will be 1. An exitcode of 0 would mean the log file exists, and that there was one or more failures in the network_tweak.bat. Where we are now. I can take the base ami run Puppet manually and have no errors. During the golden ami creation we get the error from comment 9. Looking at the spot instance it has the latest version of the network_tweak.bat, but the log file at C:\ProgramData\PuppetLabs\puppet\var\log\netsh_error.log is from 2015-11. Which leads me to believe that the schedule task is not being executed during the Puppet run from Userdata, or the deletion of the of the log file in the network_tweak.bat is failing. grenade: sorry for the over complication of this. There was an incredible amount failures that lead to it being setup like this. Any other thoughts on what could be happening? Alternatively we can delete the old log file pre-Puppet run, and revisit this is if we ever change the contents of the network_tweak.bat.
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 14•8 years ago
|
||
ok, if i understood correctly there is a log file that exists before the puppet run which shouldn't. this pr to cloud-tools removes the file (if it exists) before starting the puppet run.
Flags: needinfo?(rthijssen)
Attachment #8729094 -
Flags: review?(mcornmesser)
Reporter | ||
Comment 15•8 years ago
|
||
Comment on attachment 8729094 [details] [review] https://github.com/mozilla/build-cloud-tools/pull/194 Looks good. I am concerned about the Travis check fail though: ERROR: invocation failed, logfile: /home/travis/build/mozilla/build-cloud-tools/.tox/py27/log/py27-2.log ERROR: actionid=py27 R+ as long as that is none impacting.
Attachment #8729094 -
Flags: review?(mcornmesser) → review+
Updated•8 years ago
|
Flags: needinfo?(rthijssen)
Whiteboard: [windows][aws]
Assignee | ||
Comment 16•8 years ago
|
||
It turns out that b-2008 golden instances contain a truncated hostname in their COMPUTERNAME environment variable. This confuses the puppet cert issue process as well as the checks for a successful cert download. I believe this pr addresses both issues.
Flags: needinfo?(rthijssen)
Attachment #8749167 -
Flags: review?(mcornmesser)
Reporter | ||
Updated•8 years ago
|
Attachment #8749167 -
Flags: review?(mcornmesser) → review+
Assignee | ||
Comment 17•8 years ago
|
||
the currently running userdata has failed to obtain a cert from ca with the error message: C:\ProgramData\PuppetLabs\puppet\var\puppettize_TEMP.vbs(88, 1) Microsoft VBScript runtime error: Input past end of file new userdata from pr 213 has correctly detected the cert download failure (it didn't used to, before it would only detect puppet run failures): 2016-05-05 08:16:29 -07:00 [DEBUG] Install-Certificates :: Function started 2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: applied deploy pass 2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\certs\ca.pem 2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\certs\b-2008-ec2-golden.build.releng.use1.mozilla.com.pem 2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\private_keys\b-2008-ec2-golden.build.releng.use1.mozilla.com.pem 2016-05-05 08:16:36 -07:00 [INFO] Install-Certificates :: installing certificates 2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\private_keys 2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: created C:\ProgramData\PuppetLabs\puppet\var\ssl\private_keys 2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: removed C:\ProgramData\PuppetLabs\puppet\var\ssl\certs 2016-05-05 08:16:36 -07:00 [DEBUG] Install-Certificates :: created C:\ProgramData\PuppetLabs\puppet\var\ssl\certs 2016-05-05 08:16:54 -07:00 [INFO] Install-Certificates :: puppettize vbs run completed 2016-05-05 08:16:54 -07:00 [Error] Install-Certificates :: missing cert detected after puppetize vbs run (C:\ProgramData\PuppetLabs\puppet\var\ssl\certs\ca.pem) 2016-05-05 08:16:54 -07:00 [Error] Install-Certificates :: missing cert detected after puppetize vbs run (C:\ProgramData\PuppetLabs\puppet\var\ssl\certs\b-2008-ec2-golden.build.releng.use1.mozilla.com.pem) 2016-05-05 08:16:54 -07:00 [Error] Install-Certificates :: missing cert detected after puppetize vbs run (C:\ProgramData\PuppetLabs\puppet\var\ssl\private_keys\b-2008-ec2-golden.build.releng.use1.mozilla.com.pem) 2016-05-05 08:16:54 -07:00 [DEBUG] Install-Certificates :: Function ended 2016-05-05 08:16:54 -07:00 [ERROR] Run-Puppet :: detected puppet certificate installation failure 2016-05-05 08:16:54 -07:00 [DEBUG] Run-Puppet :: retry in 30 minutes... so I don't undertand why the puppetize run is failing, but we'll see (by letting userdata loop on cert download) if subsequent attempts are any better. Note that before we didn't reattempt cert downloads automatically, we only reattempted puppet runs. so the situation is slightly improved but I'm none the wiser about why puppettize didn't successfully download the certs on the first attempt.
Reporter | ||
Updated•8 years ago
|
Assignee: mcornmesser → rthijssen
Assignee | ||
Comment 18•8 years ago
|
||
fixed here: https://github.com/mozilla/build-cloud-tools/commit/6b180d63bee43de1a1ab834a3739194563d0b9f2
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•