Closed
Bug 1375852
Opened 7 years ago
Closed 7 years ago
Cannot create Windows golden AMIs
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aselagea, Assigned: markco)
References
Details
Attachments
(1 file)
8.05 KB,
patch
|
arich
:
review+
markco
:
checked-in+
|
Details | Diff | Splinter Review |
From #builduty: Fri 13:41:19 UTC [7048] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 20 crit, 0 warn out of 20 processes with args 'ec2-golden' (http://m.mozilla.org/procs+age+-+golden+AMI) Taking a look on aws-manager, I noticed that all 20 processes were for windows AMIs. Papertrail logs seem to show lots of unexpected shutdowns. Jun 23 02:37:42 b-2008-ec2-golden.build.releng.use1.mozilla.com EventLog: The previous system shutdown at 3:40:24 AM on 6/21/2017 was unexpected.#015 Jun 23 02:37:43 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Kernel-Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.#015 Jun 23 02:37:43 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Eventlog: Audit events have been dropped by the transport. 0#015 Jun 23 02:37:44 b-2008-ec2-golden.build.releng.use1.mozilla.com Xennet6: 2017-06-23 02:36:02 b-2008-ec2-golden.build.releng.use1.mozilla.com ERROR 5001 [The description for EventID 5001 from source Xennet6 cannot be found: The parameter is incorrect. ]#015 Jun 23 02:37:44 b-2008-ec2-golden.build.releng.use1.mozilla.com EventLog: The previous system shutdown at 3:40:24 AM on 6/21/2017 was unexpected.#015 Jun 23 02:37:44 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Kernel-Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.#015 Jun 23 02:37:45 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Eventlog: Audit events have been dropped by the transport. 0#015 Jun 23 02:38:06 b-2008-ec2-golden.build.releng.use1.mozilla.com Xennet6: 2017-06-23 02:36:01 b-2008-ec2-golden.build.releng.use1.mozilla.com ERROR 5001 [The description for EventID 5001 from source Xennet6 cannot be found: The parameter is incorrect. ]#015 Jun 23 02:38:06 b-2008-ec2-golden.build.releng.use1.mozilla.com EventLog: The previous system shutdown at 3:40:24 AM on 6/21/2017 was unexpected.#015 Jun 23 02:38:06 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Kernel-Power: The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.#015 Jun 23 02:38:06 b-2008-ec2-golden.build.releng.use1.mozilla.com Microsoft-Windows-Eventlog: Audit events have been dropped by the transport. 0#015 Jun 23 02:38:47 b-2008-ec2-golden.build.releng.use1.mozilla.com USER32: The process C:\Windows\system32\shutdown.exe (B-2008-EC2-GOLD) has initiated the restart of computer B-2008-EC2-GOLD on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned) Reason Code: 0x80040001 Shutdown Type: restart Comment: host renamed#015 Jun 23 02:38:48 b-2008-ec2-golden.build.releng.use1.mozilla.com USER32: The process C:\Windows\system32\shutdown.exe (B-2008-EC2-GOLD) has initiated the restart of computer B-2008-EC2-GOLD on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned) Reason Code: 0x80040001 Shutdown Type: restart Comment: host renamed#015
Reporter | ||
Comment 1•7 years ago
|
||
@Dragos: do you think this might be related to your recent puppet changes?
Flags: needinfo?(dcrisan)
Comment 2•7 years ago
|
||
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #1) > @Dragos: do you think this might be related to your recent puppet changes? From the logs, this unexpected shutdowns not appear to be related to the puppet changes
Updated•7 years ago
|
Flags: needinfo?(dcrisan)
Assignee | ||
Comment 3•7 years ago
|
||
I am going to terminate y-2008-ec2-gold, restart the process and watch the logs.
Assignee: nobody → mcornmesser
Component: Buildduty → RelOps
Product: Release Engineering → Infrastructure & Operations
QA Contact: catlee → arich
Version: unspecified → other
Assignee | ||
Comment 4•7 years ago
|
||
These messages are not reflecting a shutdown near the time of the message. The below message was reported about an hour after the last reboot. Jun 23 10:42:34 y-2008-ec2-golden.try.releng.use1.mozilla.com EventLog: The previous system shutdown at 3:44:00 AM on 6/21/2017 was unexpected.#015
Assignee | ||
Comment 5•7 years ago
|
||
Currently this is not getting to the Puppet run. The last Puppet run was from the previous day. https://foreman.pub.build.mozilla.org/reports/26842396
Assignee | ||
Comment 6•7 years ago
|
||
I think it is getting to the Puppet run but it is not completing successful and is not reporting back to foreman. From the end of the log on the local machine: 2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Mozilla::Git/File[C:/installersource/puppetagain.pub.build.mozilla.org/EXEs/Git-2.7.4-32-bit.exe] (notice): Dependency File[C:/installersource/puppetagain.pub.build.mozilla.org] has failures: true 2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Mozilla::Git/File[C:/installersource/puppetagain.pub.build.mozilla.org/EXEs/Git-2.7.4-32-bit.exe] (warning): Skipping because of failed dependencies 2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Mozilla::Git/Exec[Git-2.7.4-32-bit.exe] (notice): Dependency File[C:/installersource/puppetagain.pub.build.mozilla.org] has failures: true 2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Mozilla::Git/Exec[Git-2.7.4-32-bit.exe] (warning): Skipping because of failed dependencies 2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Ultravnc/Packages::Pkgmsi[UltraVnc]/File[c:/InstallerSource/puppetagain.pub.build.mozilla.org/MSIs/UltraVnc_10962_x64.msi] (notice): Dependency File[C:/installersource/puppetagain.pub.build.mozilla.org] has failures: true 2017-06-23 09:37:00 -0700 /Stage[main]/Packages::Ultravnc/Packages::Pkgmsi[UltraVnc]/File[c:/InstallerSource/puppetagain.pub.build.mozilla.org/MSIs/UltraVnc_10962_x64.msi] (warning): Skipping because of failed dependencies 2017-06-23 09:37:00 -0700 /Stage[main]/Packages::7z920/Packages::Pkgmsi[7-Zip 9.20]/File[c:/InstallerSource/puppetagain.pub.build.mozilla.org/MSIs/7z920.msi] (notice): Dependency File[C:/installersource/puppetagain.pub.build.mozilla.org] has failures: true 2017-06-23 09:37:00 -0700 /Stage[main]/Packages::7z920/Packages::Pkgmsi[7-Zip 9.20]/File[c:/InstallerSource/puppetagain.pub.build.mozilla.org/MSIs/7z920.msi] (warning): Skipping because of failed dependencies From the beginning of the log: 2017-06-23 09:35:37 -0700 Puppet (info): Applying configuration version '4579255bd9ab' 2017-06-23 09:35:38 -0700 /Stage[users]/Users::Root::Account/User[root]/password (notice): created password 2017-06-23 09:36:27 -0700 /Stage[main]/Python::Misc_python_dir/File[c:\mozilla-build]/mode (notice): mode changed '2000700' to '0755' 2017-06-23 09:36:27 -0700 /Stage[main]/Dirs::Programdata::Puppetagain/File[c:/programdata/puppetagain]/mode (notice): mode changed '2000700' to '0755' 2017-06-23 09:36:35 -0700 /Stage[main]/Hardware::Ec2_config/File[C:/Program Files/Amazon/Ec2ConfigService/Settings/config.xml]/content (notice): 2017-06-23 09:36:35 -0700 /Stage[main]/Hardware::Ec2_config/File[C:/Program Files/Amazon/Ec2ConfigService/Settings/config.xml]/content (notice): content changed '{md5}f2d59c8656d0461987982e228cbc6788' to '{md5}5daf5c07ffaf4f6f5d911113f4806022' 2017-06-23 09:36:35 -0700 /Stage[main]/Hardware::Ec2_config/File[C:/Program Files/Amazon/Ec2ConfigService/Settings/config.xml]/mode (notice): mode changed '4000777' to '0644' 2017-06-23 09:36:37 -0700 /Stage[main]/Tweaks::Vs_2013_lnk/File[C:/Program Files (x86)/Microsoft Visual Studio 10.0/VC/bin/amd64/cvtres.exe]/mode (notice): mode changed '2000700' to '0644' 2017-06-23 09:36:38 -0700 /Stage[main]/Dirs::Builds/File[C:/builds/]/mode (notice): mode changed '2000700' to '0755' 2017-06-23 09:36:38 -0700 /Stage[main]/Dirs::Builds::Slave/File[C:/builds/moz2_slave]/mode (notice): mode changed '4000750' to '0755' 2017-06-23 09:36:38 -0700 /Stage[main]/Tweaks::Windows_network_opt_netsh/File[C:/programdata/puppetagain/SchTsk_netsh.xml] (err): Could not evaluate: Could not read file C:/programdata/puppetagain/SchTsk_netsh.xml: Permission denied - C:/programdata/puppetagain/SchTsk_netsh.xml 2017-06-23 09:36:39 -0700 /Stage[main]/Runner::Service/File[c:/programdata/puppetagain/start-runner.bat] (err): Could not evaluate: Could not read file c:/programdata/puppetagain/start-runner.bat: Permission denied - c:/programdata/puppetagain/start-runner.bat 2017-06-23 09:36:41 -0700 /Stage[main]/Tweaks::Memory_paging/Registry::Value[DisablePagingExecu I suspect that something got changed that is now causing Puppet to try to apply posix type permissions for Windows. When this happens, for reasons i am not sure of, it cause the directories or files to become read only. I am going to dive back into the Puppet diffs from the last couple days.
Assignee | ||
Comment 7•7 years ago
|
||
This was removed: https://dxr.mozilla.org/build-central/source/puppet/modules/shared/lib/puppet/parser/functions/filemode.rb#9 # This Source Code Form is subject to the terms of the Mozilla Public # License, v. 2.0. If a copy of the MPL was not distributed with this # file, You can obtain one at http://mozilla.org/MPL/2.0/. module Puppet::Parser::Functions newfunction(:filemode, :type => :rvalue, :arity => 1) do |args| mode = Integer(args[0]) # on windows, we just don't manage file permissions with mode bits. It doesn't work. if lookupvar("::operatingsystem") == "windows" then :undef else args[0] end end end
Assignee | ||
Comment 8•7 years ago
|
||
Assignee | ||
Comment 9•7 years ago
|
||
Comment on attachment 8880954 [details] [diff] [review] Bug1375852.patch Reintroduce filemode function when Windows is affected.
Attachment #8880954 -
Flags: review?(jwatkins)
Updated•7 years ago
|
Attachment #8880954 -
Flags: review?(jwatkins) → review+
Assignee | ||
Comment 10•7 years ago
|
||
Comment on attachment 8880954 [details] [diff] [review] Bug1375852.patch https://hg.mozilla.org/build/puppet/rev/3f46f4bd3db5983231228e96706718f66a271426
Attachment #8880954 -
Flags: checked-in+
Assignee | ||
Comment 11•7 years ago
|
||
(In reply to Mark Cornmesser [:markco] from comment #7) > This was removed: > > https://dxr.mozilla.org/build-central/source/puppet/modules/shared/lib/ > puppet/parser/functions/filemode.rb#9 > > # This Source Code Form is subject to the terms of the Mozilla Public > # License, v. 2.0. If a copy of the MPL was not distributed with this > # file, You can obtain one at http://mozilla.org/MPL/2.0/. > > module Puppet::Parser::Functions > newfunction(:filemode, :type => :rvalue, :arity => 1) do |args| > mode = Integer(args[0]) > > # on windows, we just don't manage file permissions with mode bits. It > doesn't work. > if lookupvar("::operatingsystem") == "windows" then > :undef > else > args[0] > end > end > end The function was not removed. The calls for the function in manifests were removed.
Assignee | ||
Comment 12•7 years ago
|
||
Testing the generation of the y-2008 golden ami.
Assignee | ||
Comment 13•7 years ago
|
||
Y-2008 golden ami generation completed. I am going to terminate the other instances and check back in tomorrow morning.
Comment 14•7 years ago
|
||
The cron jobs were still running on aws-manager2. I killed those off and kicked off the jobs again.
Reporter | ||
Comment 15•7 years ago
|
||
Today's y-2008 AMI is hanging since the instance is unable to shut down. One weird thing is that we have two "spot-y-2008-2017-06-23-22-35" AMIs in us-west-2, but with different creation dates (June 24 and June 25). I wonder if that was for testing purposes.
Assignee | ||
Comment 16•7 years ago
|
||
This is odd. Last night I was able to spin up a repo on AWS manager 1 and run the scripts. The AMI was generated without error, and the instance terminated. I then jumped on AWS manager 2 and executed the sh file in the y-2008 cron file, and it ran through created the AMI, terminated the instance, and copied it to USW2. Yet, this morning it failed again.
Assignee | ||
Comment 17•7 years ago
|
||
This is what is killing the generation: Wed Jun 28 10:45:23 -0700 2017 Puppet (err): Command exceeded timeout Wed Jun 28 10:45:23 -0700 2017 /Stage[main]/Packages::Mozilla::Py27_mercurial/Exec[Mercurial-3.9.1-x64.exe]/returns (err): change from notrun to 0 failed: Command exceeded timeout And this is only occurring when the script is executed from the sh file from the cron file on AWS manager 2.
Assignee | ||
Comment 18•7 years ago
|
||
A manual Puppet run finishes with out the Mercurial installation error.
Reporter | ||
Comment 20•7 years ago
|
||
This has occurred again today.
Assignee | ||
Comment 21•7 years ago
|
||
I am going to update the AMI in cloudtools on Wednesday. I have an AMI ready, but I will like a to test a few builds based off of it before putting into production.
Assignee | ||
Comment 22•7 years ago
|
||
I think we might be good on this. I updated Cloud tools with the new AMI, and I have been able to generate multiple AMIs on AWS manager 2 using the sh script. The real test will be when the script kicks off during its schedule time.
Assignee | ||
Comment 23•7 years ago
|
||
The y-2008 AMI generated this morning.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•