Closed Bug 1362356 Opened 8 years ago Closed 8 years ago

Bad computer name env. variable for Windows AMIs

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aselagea, Assigned: markco)

References

Details

Attachments

(5 files, 2 obsolete files)

Noticed some issues with the AMI generation process for b/y-2008 and t/g-w732 today due to a wrongly set computer name environment variable. That caused the instances to change the hostname and then reboot, but since that wasn't actually fixing the issue, it resulted in an infinite loop of reboots which were overloading the puppet masters. e.g. 2017-05-05 02:57:46 -07:00 [INFO] Primary DNS suffix set to: test.releng.use1.mozilla.com 2017-05-05 02:57:46 -07:00 [DEBUG] net dns hostname: g-w732-ec2-golden, expected: g-w732-ec2-golden 2017-05-05 02:57:46 -07:00 [DEBUG] computer name env var: G-W732-EC2-GOLD, expected: g-w732-ec2-golden 2017-05-05 02:57:47 -07:00 [INFO] hostname set to: g-w732-ec2-golden 2017-05-05 02:57:47 -07:00 [INFO] shutting down with reason: host renamed
I terminated the instances and re-ran the scripts for those AMIs, but got into the same issues again. @markco: did something change in the AMI generation process recently?
Flags: needinfo?(mcornmesser)
if there are new amis with a creation date right before the problem started happening, i would just delete the new amis (and any created since) in both regions. this would force use of last known good ami.
No, we don't have new AMIs for Windows today. So we're using the AMIs from yesterday (which are fine).
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #1) > I terminated the instances and re-ran the scripts for those AMIs, but got > into the same issues again. > > @markco: did something change in the AMI generation process recently? Nothing that I am aware of. I will take a look into it today.
Flags: needinfo?(mcornmesser)
I know mark was taking a look at this yesterday, but it appears that it hasn't been solved. THere were a ton of puppet attempts when the cron jobs kicked off last night. I've killed off those processes on aws-manager2. For the time being, I'm just going to leave the golden instance up so it doesn't try to do the same thing tomorrow (it'll just fail saying that the IP is already allocated).
Assignee: nobody → mcornmesser
Component: Buildduty → RelOps
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → arich
I take it back, they were still trying to run puppet, so I terminated the instances to stop the puppet spam.
Today we have nagios alerts like: <relengbot> [sns alert] May 07 01:13:01 aws-manager2.srv.releng.scl3.mozilla.com try-linux64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/try-linux64-ec2-golden.lock" Also for bld-linux64-ec2-golden, av-linux64-ec2-golden, tst-linux64-ec2-golden, tst-linux32-ec2-golden, y-2008-ec2-golden, b-2008-ec2-golden, g-w732-ec2-golden, t-w732-ec2-golden, tst-emulator64-ec2-golden. Maybe from earlier interventions, or crond has started multiple jobs. A papertrail search indicates the latter ('CROND tst-emulator64-ec2-golden' turns up two for the 6th and 7th, with the same timestamp; only one earlier). We've seen cron get confused before so I've restarted crond on aws-manager2.
Then there are alerts like: 01:38 <nagios-releng> Sun 06:38:43 PDT [4023] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 10 crit, 0 warn out of 20 processes with args ec2-golden 03:38 <nagios-releng> Sun 08:38:42 PDT [4027] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 20 crit, 0 warn out of 20 processes with args ec2-golden It's actually 4 golden AMI generations that have gone wrong - g-w732, t-w732, b-2008, y-2008 There appear to be a couple of errors in the puppet mail: 1. chocolatey might have blacklisted us 2017-05-07T23:33:38.703Z: Ec2HandleUserData: Message: The errors from user scripts: Exception calling "DownloadString" with "1" argument(s): "The remote server ret urned an error: (403) Forbidden." At C:\windows\system32\WindowsPowerShell\v1.0\Modules\Ec2UserdataUtils\Ec2Userd ataUtils.psm1:1746 char:70 + Invoke-Expression ((New-Object System.Net.WebClient).DownloadString <<<< (' https://chocolatey.org/install.ps1')) + CategoryInfo : NotSpecified: (:) [], MethodInvocationException + FullyQualifiedErrorId : DotNetMethodException And if I do this from buildbot-master82: $ wget https://chocolatey.org/install.ps1 --2017-05-07 17:08:15-- https://chocolatey.org/install.ps1 Resolving chocolatey.org... 104.20.73.28, 104.20.74.28, 2400:cb00:2048:1::6814:4a1c, ... Connecting to chocolatey.org|104.20.73.28|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2017-05-07 17:08:15 ERROR 403: Forbidden. Works fine locally, so I suspect they've blocked the SCL3 NAT (63.245.214.82). I found this tweet: https://twitter.com/ferventcoder/status/860245736973316098 If you just started seeing 403 errors with the #chocolatey community repository today, please reach out to us at https://gitter.im/chocolatey/choco Does that match up ? I've commented on that glitter. 2, sublimetest3 isn't installing Userdata: Install-Package :: sublimetext3.packagecontrol install failed with exit code: 1#015 The last time we successfully built windows AMI are 2015-05-04.
I mentinoed in irc that I killed off the instances (to stop the puppet email) but left the processes running on aws-manager2 (to stop them from kicking off again tonight) to give markco a chance to look at them tomorrow.
(In reply to Nick Thomas [:nthomas] from comment #8) > The last time we successfully built windows AMI are 2015-05-04. Correction - 2017-05-04. I spoke with Rob Reynolds of Chocolatey (via Gitter chat) and he confirmed that they've started blocking our requests. Their server stats have about 4 million package installs from our IP over the last 30 days, the vast majority of them of chocolatey itself. His suggestions are to * check for this chef bug - https://github.com/chocolatey/chocolatey-cookbook/issues/105 * install chocolatey from an internal location This can only be our buildbot infra given it's from the single NAT IP, which I thinks means t/g-w732 and b/y-2008 (not used on the ix machines). It's almost like we re-install every time we start a spot instance, although that would still be an average of 130k instance starts a day, which seems high. AIUI the install happens in Install-BasePrerequisites https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/Ec2UserdataUtils.psm1#L1740 called by b-2008.user-data (which is symlinked to the other worker pools mentioned above): https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/b-2008.user-data#L110 Perhaps we just move Install-BasePrerequisites into Prep-Golden. Over to Mark now for proper debugging. Rob R. said to let them know when they can drop the block.
(In reply to Nick Thomas [:nthomas] from comment #7) > Today we have nagios alerts like: > <relengbot> [sns alert] May 07 01:13:01 > aws-manager2.srv.releng.scl3.mozilla.com try-linux64-ec2-golden: lockfile: > Sorry, giving up on "/builds/aws_manager/try-linux64-ec2-golden.lock" > > Also for bld-linux64-ec2-golden, av-linux64-ec2-golden, > tst-linux64-ec2-golden, tst-linux32-ec2-golden, y-2008-ec2-golden, > b-2008-ec2-golden, g-w732-ec2-golden, t-w732-ec2-golden, > tst-emulator64-ec2-golden. Maybe from earlier interventions, or crond has > started multiple jobs. A papertrail search indicates the latter ('CROND > tst-emulator64-ec2-golden' turns up two for the 6th and 7th, with the same > timestamp; only one earlier). We've seen cron get confused before so I've > restarted crond on aws-manager2. The same alerts were spotted again today in #buildduty. There were two crond instances running on aws-manager. root 5852 1 0 Feb23 ? 00:01:40 crond root 6325 1 0 May07 ? 00:00:00 crond For some reason, running "/etc/init.d/crond restart" will not stop the older process so it will result in two different processes. I killed both of them, then started crond again. Now we only have one process running.
Assigning this to rob to fix the chocolatey issues. I'm not sure that's the underlying problem, but we need to fix that, regardless.
Assignee: mcornmesser → rthijssen
install chocolatey from internal infra: https://github.com/mozilla-releng/build-cloud-tools/commit/940ca418fd9c33f6ed91837576f5f47ac8c4cb3b still need to alter the script at http://releng-puppet2.srv.releng.scl3.mozilla.com/repos/EXEs/chocolatey/install.ps1 to make sure it also downloads any artifacts it needs from infra.
install chocolatey from internal infra: https://github.com/mozilla-releng/build-cloud-tools/commit/fcaa07a06044ed3ad40601bf86b78207efb0650e this should resolve the high traffic to chocolatey.org
Landed https://github.com/mozilla-releng/build-cloud-tools/commit/b142bfb0582ed1887a37e5e42d14ec7e1c34331b to try to resolve the tree closure in bug 1363204 (only run Install-BasePrerequisites on the golden ami).
That worked for new instances, but there might still be a netflow issue for the golden AMIs.
As far as I can tell there has been no changes to github cloud tools repo or Puppet that would have affected Windows golden AMI creation. grenade: I am kind of at a lose here. After hardcoding the url for chocolatly and commented out the reboot after being renamed, I am able to run the userdata script, manual reboot, rerun the script, and it will work through to puppetizing. As well as on the second run the the computername variable is correct. This seems to be only happening when EC2 service is executing the scripts. Any thoughts? Any ideas on where else to look?
Flags: needinfo?(rthijssen)
presumably the new problem relates to my use of the $domain variable in the url to the puppet server. it's probably worth trying a url of: http://puppet/repos/EXEs/chocolatey/... instead of attempting to build an fqdn as i did in my patches. the hostname "puppet" is *supposed to* resolve to the nearest puppet server (courtesy of our dns settings), so as long as that still works, it may fix the issue. there's probably something screwy in my attempts to build an fqdn to the puppet server for the correct region.
Flags: needinfo?(rthijssen)
We should be good after this lands. In testing I was able to spin up and capture a y-2008 golden AMI. The instances it spawned successfully finished builds.
Attachment #8866583 - Flags: review?(rthijssen)
I let upstream know we fixed the issue with installing chocolatey on every boot.
We also noticed that there's a tooltool download failing, and the tokens that should be on disk are missing on those instances that are failing the download. This + the ssh issues feel like some instances are instantiating as loaners and having their secrets wiped: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/Ec2UserdataUtils.psm1#L775 I think the hostname is now matching the loaner string because it's no longer correctly reporting as golden: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/b-2008.user-data#L126
None of the log messages found in the ec2data reports sent to the puppet alias show the log lines that would indicate that secrets were purged. Those lines haven't been seen in a report since loaners set up in January.
Comment on attachment 8867912 [details] [review] Replace portion of the Install-RelOpsPrerequisites function I think this one will be superceded.
Attachment #8867912 - Flags: review?(nthomas)
For testing/debugging purposes, I've created a different AWS config and am using /builds/aws_manager/bin/aws_manager-b-2008-arr-ec2-golden.sh to generate an AMI. It uses /builds/aws_manager/cloud-tools/configs/b-2008-arr and /builds/aws_manager/cloud-tools/configs/b-2008-arr.user-data which reads its psm1 config from my fork of the build-cloud-tools repo instead of the live production repo (so I can make changes without affecting prod). Once testing is done, those three files should be cleaned up.
I've reverted b9c1aed49990b4a7e7ad28a90e030219d2634f5f to see if that helps until markco is awake.
Blocks: 1365219
I'm going to temporarily turn on mail for EVERY run, including spot. This is going to generate a flood of email, but it's our best hope of seeing what's happening.
Attachment #8868115 - Flags: review?(aselagea)
Comment on attachment 8868115 [details] [diff] [review] enable-spot-userdata-email.patch Looks good!
Attachment #8868115 - Flags: review?(aselagea) → review+
We launched a lot of instances we got data from, so reverted comment 30.
I've retagged the t-w732 g-w732 b-2008 and y-2008 AMIs created after may 3rd as <host>-1362356. This should roll back the AMIs for new instances created today. I don't think that's going to help, since I think the issue is with UserData, but it returns us to a time when we weren't failing generating the golden and before we made changes to accommodate that.
Correction, the move to c4.4xlarge was merged on the 2nd, so I rolled back to the AMIs for the 2nd.
Attached patch bug1362356.patch (obsolete) — Splinter Review
Attachment #8868230 - Flags: review?(arich)
Attachment #8868230 - Flags: review?(arich)
Attached patch bug1362356-2.patch (obsolete) — Splinter Review
Attachment #8868230 - Attachment is obsolete: true
Attachment #8868233 - Flags: review?(arich)
Attachment #8868233 - Flags: review?(arich)
Attachment #8868233 - Attachment is obsolete: true
Attachment #8868277 - Flags: review?(arich)
Attachment #8868277 - Flags: review?(arich) → review+
Attached patch e2c-debacle.diffSplinter Review
There was a lot of debugging today, and Q, markco, and I found and corrected a lot of logic errors and issues. Instead of detailing them all with individual patches, I'm going to include this one large patch that shows the diff between then things were working last and now. We think there were a number of contributing factors to the issues we've seen. First, we think something changed with the way that AWS is reporting env variables, because a bunch of things broke that are based on env vars. There was a broken regex for matching loaner instances. We think that might have also been hitting b-2008 because of the hostname/env var issues. This would have partially run the loaner function and deleted ssh keys. There was a race condition with mounting the ephemeral disk and copying things back to it. The host would sometimes reboot in the middle of that process, so we were missing buildapi tokens (and other files). We also put in a semaphore to make sure that runner would not start until the files were all copied over, just in case we were hitting another race condition where runner would pick a job without the token. We also found that puppet was running on every reboot. That's been disabled except on the golden instance. We were installing spurious packages like chocolatey (which we had added when we thought the future was going to be buildbot+AWS+puppet). We wound up DDOSing the chocolately folks because that was ALSO running and installing chocolately from their infrastructure on every reboot. Rob's original fix for this ripped out a bit too much and we lost some functions we actually needed (nxlog config, windows error reporting, stopping puppet, etc). Those were added back in. Install-MozillaBuildAndPrerequisites was removed because all of this is handled by puppet. Same with Enable-CloneBundle. In addition to this, we still need to fix the golden AMI generation. It's failing do the regex match failure we originally saw when it tried to generate the AMI on the 5th. markco is working on a patch for that now.
One of the other potential issues was the switch from c3.4xlarge to c4.4xlarge which got reverted. At the very least, it slowed builds down considerably and may have exacerbated some of the timeouts and race conditions we saw.
This has been running for a couple days and the issues seem to be cleared up.
Assignee: rthijssen → mcornmesser
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: