Closed Bug 1375852 Opened 7 years ago Closed 7 years ago

Cannot create Windows golden AMIs

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aselagea, Assigned: markco)

References

Details

Attachments

(1 file)

From #builduty:
Fri 13:41:19 UTC [7048] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 20 crit, 0 warn out of 20 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI)

Taking a look on aws-manager, I noticed that all 20 processes were for windows AMIs. Papertrail logs seem to show lots of unexpected shutdowns.

Jun 23 02:37:42 b-2008-ec2-golden.build.releng.use1.mozilla.com EventLog: The previous system shutdown at 3:40:24 AM on ‎6/‎21/‎2017 was unexpected.#015
Jun 23 02:37:43 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Kernel-Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.#015
Jun 23 02:37:43 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Eventlog: Audit events have been dropped by the transport.  0#015
Jun 23 02:37:44 b-2008-ec2-golden.build.releng.use1.mozilla.com Xennet6: 2017-06-23 02:36:02 b-2008-ec2-golden.build.releng.use1.mozilla.com ERROR 5001 [The description for EventID 5001 from source Xennet6 cannot be found: The parameter is incorrect.  ]#015
Jun 23 02:37:44 b-2008-ec2-golden.build.releng.use1.mozilla.com EventLog: The previous system shutdown at 3:40:24 AM on ‎6/‎21/‎2017 was unexpected.#015
Jun 23 02:37:44 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Kernel-Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.#015
Jun 23 02:37:45 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Eventlog: Audit events have been dropped by the transport.  0#015
Jun 23 02:38:06 b-2008-ec2-golden.build.releng.use1.mozilla.com Xennet6: 2017-06-23 02:36:01 b-2008-ec2-golden.build.releng.use1.mozilla.com ERROR 5001 [The description for EventID 5001 from source Xennet6 cannot be found: The parameter is incorrect.  ]#015
Jun 23 02:38:06 b-2008-ec2-golden.build.releng.use1.mozilla.com EventLog: The previous system shutdown at 3:40:24 AM on ‎6/‎21/‎2017 was unexpected.#015
Jun 23 02:38:06 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Kernel-Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.#015
Jun 23 02:38:06 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Eventlog: Audit events have been dropped by the transport.  0#015
Jun 23 02:38:47 b-2008-ec2-golden.build.releng.use1.mozilla.com USER32: The process C:\Windows\system32\shutdown.exe (B-2008-EC2-GOLD) has initiated the restart of computer B-2008-EC2-GOLD on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned)   Reason Code: 0x80040001   Shutdown Type: restart   Comment: host renamed#015
Jun 23 02:38:48 b-2008-ec2-golden.build.releng.use1.mozilla.com USER32: The process C:\Windows\system32\shutdown.exe (B-2008-EC2-GOLD) has initiated the restart of computer B-2008-EC2-GOLD on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned)   Reason Code: 0x80040001   Shutdown Type: restart   Comment: host renamed#015
@Dragos: do you think this might be related to your recent puppet changes?
Flags: needinfo?(dcrisan)
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #1)
> @Dragos: do you think this might be related to your recent puppet changes?

From the logs, this unexpected shutdowns not appear to be related to the puppet changes
Flags: needinfo?(dcrisan)
 I am going to terminate y-2008-ec2-gold, restart the process and watch the logs.
Assignee: nobody → mcornmesser
Component: Buildduty → RelOps
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee → arich
Version: unspecified → other
These messages are not reflecting a shutdown near the time of the message. The below message was reported about an hour after the last reboot. 

Jun 23 10:42:34 y-2008-ec2-golden.try.releng.use1.mozilla.com EventLog: The previous system shutdown at 3:44:00 AM on ‎6/‎21/‎2017 was unexpected.#015
Currently this is not getting to the Puppet run. The last Puppet run was from the previous day. 

https://foreman.pub.build.mozilla.org/reports/26842396
I think it is getting to the Puppet run but it is not completing successful and is not reporting back to foreman. From the end of the log on the local machine: 

2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Mozilla::Git/File[C:/installersource/puppetagain.pub.build.mozilla.org/EXEs/Git-2.7.4-32-bit.exe] (notice): Dependency File[C:/installersource/puppetagain.pub.build.mozilla.org] has failures: true
2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Mozilla::Git/File[C:/installersource/puppetagain.pub.build.mozilla.org/EXEs/Git-2.7.4-32-bit.exe] (warning): Skipping because of failed dependencies
2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Mozilla::Git/Exec[Git-2.7.4-32-bit.exe] (notice): Dependency File[C:/installersource/puppetagain.pub.build.mozilla.org] has failures: true
2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Mozilla::Git/Exec[Git-2.7.4-32-bit.exe] (warning): Skipping because of failed dependencies
2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Ultravnc/Packages::Pkgmsi[UltraVnc]/File[c:/InstallerSource/puppetagain.pub.build.mozilla.org/MSIs/UltraVnc_10962_x64.msi] (notice): Dependency File[C:/installersource/puppetagain.pub.build.mozilla.org] has failures: true
2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Ultravnc/Packages::Pkgmsi[UltraVnc]/File[c:/InstallerSource/puppetagain.pub.build.mozilla.org/MSIs/UltraVnc_10962_x64.msi] (warning): Skipping because of failed dependencies
2017-06-23 09:37:00 -0700 /Stage[main]/Packages::7z920/Packages::Pkgmsi[7-Zip 9.20]/File[c:/InstallerSource/puppetagain.pub.build.mozilla.org/MSIs/7z920.msi] (notice): Dependency File[C:/installersource/puppetagain.pub.build.mozilla.org] has failures: true
2017-06-23 09:37:00 -0700 /Stage[main]/Packages::7z920/Packages::Pkgmsi[7-Zip 9.20]/File[c:/InstallerSource/puppetagain.pub.build.mozilla.org/MSIs/7z920.msi] (warning): Skipping because of failed dependencies
 
From the beginning of the log:

2017-06-23 09:35:37 -0700 Puppet (info): Applying configuration version '4579255bd9ab'
2017-06-23 09:35:38 -0700 /Stage[users]/Users::Root::Account/User[root]/password (notice): created password
2017-06-23 09:36:27 -0700 /Stage[main]/Python::Misc_python_dir/File[c:\mozilla-build]/mode (notice): mode changed '2000700' to '0755'
2017-06-23 09:36:27 -0700 /Stage[main]/Dirs::Programdata::Puppetagain/File[c:/programdata/puppetagain]/mode (notice): mode changed '2000700' to '0755'
2017-06-23 09:36:35 -0700 /Stage[main]/Hardware::Ec2_config/File[C:/Program Files/Amazon/Ec2ConfigService/Settings/config.xml]/content (notice): 
2017-06-23 09:36:35 -0700 /Stage[main]/Hardware::Ec2_config/File[C:/Program Files/Amazon/Ec2ConfigService/Settings/config.xml]/content (notice): content changed '{md5}f2d59c8656d0461987982e228cbc6788' to '{md5}5daf5c07ffaf4f6f5d911113f4806022'
2017-06-23 09:36:35 -0700 /Stage[main]/Hardware::Ec2_config/File[C:/Program Files/Amazon/Ec2ConfigService/Settings/config.xml]/mode (notice): mode changed '4000777' to '0644'
2017-06-23 09:36:37 -0700 /Stage[main]/Tweaks::Vs_2013_lnk/File[C:/Program Files (x86)/Microsoft Visual Studio 10.0/VC/bin/amd64/cvtres.exe]/mode (notice): mode changed '2000700' to '0644'
2017-06-23 09:36:38 -0700 /Stage[main]/Dirs::Builds/File[C:/builds/]/mode (notice): mode changed '2000700' to '0755'
2017-06-23 09:36:38 -0700 /Stage[main]/Dirs::Builds::Slave/File[C:/builds/moz2_slave]/mode (notice): mode changed '4000750' to '0755'
2017-06-23 09:36:38 -0700 /Stage[main]/Tweaks::Windows_network_opt_netsh/File[C:/programdata/puppetagain/SchTsk_netsh.xml] (err): Could not evaluate: Could not read file C:/programdata/puppetagain/SchTsk_netsh.xml: Permission denied - C:/programdata/puppetagain/SchTsk_netsh.xml
2017-06-23 09:36:39 -0700 /Stage[main]/Runner::Service/File[c:/programdata/puppetagain/start-runner.bat] (err): Could not evaluate: Could not read file c:/programdata/puppetagain/start-runner.bat: Permission denied - c:/programdata/puppetagain/start-runner.bat
2017-06-23 09:36:41 -0700 /Stage[main]/Tweaks::Memory_paging/Registry::Value[DisablePagingExecu

I suspect that something got changed that is now causing Puppet to try to apply posix type permissions for Windows.  When this happens, for reasons i am not sure of, it cause the directories or files to become read only. I am going to dive back into the Puppet diffs from the last couple days.
This was removed:

https://dxr.mozilla.org/build-central/source/puppet/modules/shared/lib/puppet/parser/functions/filemode.rb#9

# This Source Code Form is subject to the terms of the Mozilla Public
# License, v. 2.0. If a copy of the MPL was not distributed with this
# file, You can obtain one at http://mozilla.org/MPL/2.0/.

module Puppet::Parser::Functions
  newfunction(:filemode, :type => :rvalue, :arity => 1) do |args|
    mode = Integer(args[0])

    # on windows, we just don't manage file permissions with mode bits.  It doesn't work.
    if lookupvar("::operatingsystem") == "windows" then
        :undef
    else
        args[0]
    end
  end
end
Comment on attachment 8880954 [details] [diff] [review]
Bug1375852.patch

Reintroduce filemode function when Windows is affected.
Attachment #8880954 - Flags: review?(jwatkins)
Attachment #8880954 - Flags: review?(jwatkins) → review+
(In reply to Mark Cornmesser [:markco] from comment #7)
> This was removed:
> 
> https://dxr.mozilla.org/build-central/source/puppet/modules/shared/lib/
> puppet/parser/functions/filemode.rb#9
> 
> # This Source Code Form is subject to the terms of the Mozilla Public
> # License, v. 2.0. If a copy of the MPL was not distributed with this
> # file, You can obtain one at http://mozilla.org/MPL/2.0/.
> 
> module Puppet::Parser::Functions
>   newfunction(:filemode, :type => :rvalue, :arity => 1) do |args|
>     mode = Integer(args[0])
> 
>     # on windows, we just don't manage file permissions with mode bits.  It
> doesn't work.
>     if lookupvar("::operatingsystem") == "windows" then
>         :undef
>     else
>         args[0]
>     end
>   end
> end

The function was not removed. The calls for the function in manifests were removed.
Testing the generation of the y-2008 golden ami.
Y-2008 golden ami generation completed. I am going to terminate the other instances and check back in tomorrow morning.
The cron jobs were still running on aws-manager2. I killed those off and kicked off the jobs again.
Today's y-2008 AMI is hanging since the instance is unable to shut down. 

One weird thing is that we have two "spot-y-2008-2017-06-23-22-35" AMIs in us-west-2, but with different creation dates (June 24 and June 25). I wonder if that was for testing purposes.
This is odd. Last night I was able to spin up a repo on AWS manager 1 and run the scripts. The AMI was generated without error, and the instance terminated. I then jumped on AWS manager 2 and executed the sh file in the y-2008 cron file, and it ran through created the AMI, terminated the instance, and copied it to USW2. Yet, this morning it failed again.
This is what is killing the generation: 

Wed Jun 28 10:45:23 -0700 2017 Puppet (err): Command exceeded timeout
Wed Jun 28 10:45:23 -0700 2017 /Stage[main]/Packages::Mozilla::Py27_mercurial/Exec[Mercurial-3.9.1-x64.exe]/returns (err): change from notrun to 0 failed: Command exceeded timeout

And this is only occurring when the script is executed from the sh file from the cron file on AWS manager 2.
A manual Puppet run finishes with out the Mercurial installation error.
This has occurred again today.
I am going to update the AMI in cloudtools on Wednesday. I have an AMI ready, but I will like a to test a few builds based off of it before putting into production.
I think we might be good on this. I updated Cloud tools with the new AMI, and I have been able to generate multiple AMIs on AWS manager 2 using the sh script. 

The real test will be when the script kicks off during its schedule time.
The y-2008 AMI generated this morning.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: