Closed Bug 1362356 Opened 8 years ago Closed 8 years ago

Bad computer name env. variable for Windows AMIs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: aselagea, Assigned: markco)

References

Details

Attachments

(5 files, 2 obsolete files)

Remove Chocolatly and functions dependent on it 8 years ago Mark Cornmesser [:markco] 64 bytes, text/x-github-pull-request	grenade : review+	Details \| Review
Replace portion of the Install-RelOpsPrerequisites function 8 years ago Mark Cornmesser [:markco] 60 bytes, text/x-github-pull-request		Details \| Review
enable-spot-userdata-email.patch 8 years ago Amy Rich [:arr] [:arich] 1.44 KB, patch	aselagea : review+	Details \| Diff \| Splinter Review
bug1362356.patch 8 years ago Mark Cornmesser [:markco] 736 bytes, patch		Details \| Diff \| Splinter Review
bug1362356-2.patch 8 years ago Mark Cornmesser [:markco] 491 bytes, patch		Details \| Diff \| Splinter Review
bug1362356-3.patch 8 years ago Mark Cornmesser [:markco] 495 bytes, patch	arich : review+	Details \| Diff \| Splinter Review
e2c-debacle.diff 8 years ago Amy Rich [:arr] [:arich] 13.09 KB, patch		Details \| Diff \| Splinter Review

Alin Selagea [:aselagea]

Reporter

Description

•

8 years ago

Noticed some issues with the AMI generation process for b/y-2008 and t/g-w732 today due to a wrongly set computer name environment variable. That caused the instances to change the hostname and then reboot, but since that wasn't actually fixing the issue, it resulted in an infinite loop of reboots which were overloading the puppet masters. e.g. 2017-05-05 02:57:46 -07:00 [INFO] Primary DNS suffix set to: test.releng.use1.mozilla.com 2017-05-05 02:57:46 -07:00 [DEBUG] net dns hostname: g-w732-ec2-golden, expected: g-w732-ec2-golden 2017-05-05 02:57:46 -07:00 [DEBUG] computer name env var: G-W732-EC2-GOLD, expected: g-w732-ec2-golden 2017-05-05 02:57:47 -07:00 [INFO] hostname set to: g-w732-ec2-golden 2017-05-05 02:57:47 -07:00 [INFO] shutting down with reason: host renamed

Alin Selagea [:aselagea]

Reporter

Comment 1

•

8 years ago

I terminated the instances and re-ran the scripts for those AMIs, but got into the same issues again. @markco: did something change in the AMI generation process recently?

Flags: needinfo?(mcornmesser)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 2

•

8 years ago

if there are new amis with a creation date right before the problem started happening, i would just delete the new amis (and any created since) in both regions. this would force use of last known good ami.

Alin Selagea [:aselagea]

Reporter

Comment 3

•

8 years ago

No, we don't have new AMIs for Windows today. So we're using the AMIs from yesterday (which are fine).

Mark Cornmesser [:markco]

Assignee

Comment 4

•

8 years ago

(In reply to Alin Selagea [:aselagea][:buildduty] from comment #1) > I terminated the instances and re-ran the scripts for those AMIs, but got > into the same issues again. > > @markco: did something change in the AMI generation process recently? Nothing that I am aware of. I will take a look into it today.

Flags: needinfo?(mcornmesser)

Amy Rich [:arr] [:arich]

Comment 5

•

8 years ago

I know mark was taking a look at this yesterday, but it appears that it hasn't been solved. THere were a ton of puppet attempts when the cron jobs kicked off last night. I've killed off those processes on aws-manager2. For the time being, I'm just going to leave the golden instance up so it doesn't try to do the same thing tomorrow (it'll just fail saying that the IP is already allocated).

Assignee: nobody → mcornmesser

Component: Buildduty → RelOps

Product: Release Engineering → Infrastructure & Operations

QA Contact: bugspam.Callek → arich

Amy Rich [:arr] [:arich]

Comment 6

•

8 years ago

I take it back, they were still trying to run puppet, so I terminated the instances to stop the puppet spam.

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

8 years ago

Today we have nagios alerts like: <relengbot> [sns alert] May 07 01:13:01 aws-manager2.srv.releng.scl3.mozilla.com try-linux64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/try-linux64-ec2-golden.lock" Also for bld-linux64-ec2-golden, av-linux64-ec2-golden, tst-linux64-ec2-golden, tst-linux32-ec2-golden, y-2008-ec2-golden, b-2008-ec2-golden, g-w732-ec2-golden, t-w732-ec2-golden, tst-emulator64-ec2-golden. Maybe from earlier interventions, or crond has started multiple jobs. A papertrail search indicates the latter ('CROND tst-emulator64-ec2-golden' turns up two for the 6th and 7th, with the same timestamp; only one earlier). We've seen cron get confused before so I've restarted crond on aws-manager2.

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

8 years ago

Then there are alerts like: 01:38 <nagios-releng> Sun 06:38:43 PDT [4023] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 10 crit, 0 warn out of 20 processes with args ec2-golden 03:38 <nagios-releng> Sun 08:38:42 PDT [4027] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 20 crit, 0 warn out of 20 processes with args ec2-golden It's actually 4 golden AMI generations that have gone wrong - g-w732, t-w732, b-2008, y-2008 There appear to be a couple of errors in the puppet mail: 1. chocolatey might have blacklisted us 2017-05-07T23:33:38.703Z: Ec2HandleUserData: Message: The errors from user scripts: Exception calling "DownloadString" with "1" argument(s): "The remote server ret urned an error: (403) Forbidden." At C:\windows\system32\WindowsPowerShell\v1.0\Modules\Ec2UserdataUtils\Ec2Userd ataUtils.psm1:1746 char:70 + Invoke-Expression ((New-Object System.Net.WebClient).DownloadString <<<< (' https://chocolatey.org/install.ps1')) + CategoryInfo : NotSpecified: (:) [], MethodInvocationException + FullyQualifiedErrorId : DotNetMethodException And if I do this from buildbot-master82: $ wget https://chocolatey.org/install.ps1 --2017-05-07 17:08:15-- https://chocolatey.org/install.ps1 Resolving chocolatey.org... 104.20.73.28, 104.20.74.28, 2400:cb00:2048:1::6814:4a1c, ... Connecting to chocolatey.org|104.20.73.28|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2017-05-07 17:08:15 ERROR 403: Forbidden. Works fine locally, so I suspect they've blocked the SCL3 NAT (63.245.214.82). I found this tweet: https://twitter.com/ferventcoder/status/860245736973316098 If you just started seeing 403 errors with the #chocolatey community repository today, please reach out to us at https://gitter.im/chocolatey/choco Does that match up ? I've commented on that glitter. 2, sublimetest3 isn't installing Userdata: Install-Package :: sublimetext3.packagecontrol install failed with exit code: 1#015 The last time we successfully built windows AMI are 2015-05-04.

Amy Rich [:arr] [:arich]

Comment 9

•

8 years ago

I mentinoed in irc that I killed off the instances (to stop the puppet email) but left the processes running on aws-manager2 (to stop them from kicking off again tonight) to give markco a chance to look at them tomorrow.

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

8 years ago

(In reply to Nick Thomas [:nthomas] from comment #8) > The last time we successfully built windows AMI are 2015-05-04. Correction - 2017-05-04. I spoke with Rob Reynolds of Chocolatey (via Gitter chat) and he confirmed that they've started blocking our requests. Their server stats have about 4 million package installs from our IP over the last 30 days, the vast majority of them of chocolatey itself. His suggestions are to * check for this chef bug - https://github.com/chocolatey/chocolatey-cookbook/issues/105 * install chocolatey from an internal location This can only be our buildbot infra given it's from the single NAT IP, which I thinks means t/g-w732 and b/y-2008 (not used on the ix machines). It's almost like we re-install every time we start a spot instance, although that would still be an average of 130k instance starts a day, which seems high. AIUI the install happens in Install-BasePrerequisites https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/Ec2UserdataUtils.psm1#L1740 called by b-2008.user-data (which is symlinked to the other worker pools mentioned above): https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/b-2008.user-data#L110 Perhaps we just move Install-BasePrerequisites into Prep-Golden. Over to Mark now for proper debugging. Rob R. said to let them know when they can drop the block.

Alin Selagea [:aselagea]

Reporter

Comment 11

•

8 years ago

(In reply to Nick Thomas [:nthomas] from comment #7) > Today we have nagios alerts like: > <relengbot> [sns alert] May 07 01:13:01 > aws-manager2.srv.releng.scl3.mozilla.com try-linux64-ec2-golden: lockfile: > Sorry, giving up on "/builds/aws_manager/try-linux64-ec2-golden.lock" > > Also for bld-linux64-ec2-golden, av-linux64-ec2-golden, > tst-linux64-ec2-golden, tst-linux32-ec2-golden, y-2008-ec2-golden, > b-2008-ec2-golden, g-w732-ec2-golden, t-w732-ec2-golden, > tst-emulator64-ec2-golden. Maybe from earlier interventions, or crond has > started multiple jobs. A papertrail search indicates the latter ('CROND > tst-emulator64-ec2-golden' turns up two for the 6th and 7th, with the same > timestamp; only one earlier). We've seen cron get confused before so I've > restarted crond on aws-manager2. The same alerts were spotted again today in #buildduty. There were two crond instances running on aws-manager. root 5852 1 0 Feb23 ? 00:01:40 crond root 6325 1 0 May07 ? 00:00:00 crond For some reason, running "/etc/init.d/crond restart" will not stop the older process so it will result in two different processes. I killed both of them, then started crond again. Now we only have one process running.

Amy Rich [:arr] [:arich]

Comment 12

•

8 years ago

Assigning this to rob to fix the chocolatey issues. I'm not sure that's the underlying problem, but we need to fix that, regardless.

Assignee: mcornmesser → rthijssen

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 13

•

8 years ago

install chocolatey from internal infra: https://github.com/mozilla-releng/build-cloud-tools/commit/940ca418fd9c33f6ed91837576f5f47ac8c4cb3b still need to alter the script at http://releng-puppet2.srv.releng.scl3.mozilla.com/repos/EXEs/chocolatey/install.ps1 to make sure it also downloads any artifacts it needs from infra.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 14

•

8 years ago

install chocolatey from internal infra: https://github.com/mozilla-releng/build-cloud-tools/commit/fcaa07a06044ed3ad40601bf86b78207efb0650e this should resolve the high traffic to chocolatey.org

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 15

•

8 years ago

maybe this time: https://github.com/mozilla-releng/build-cloud-tools/commit/af61ec8037b4b55e177834c35c400c8c36e52d4d

Nick Thomas [:nthomas] (UTC+12)

Updated

•

8 years ago

Depends on: 1363204

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

8 years ago

Landed https://github.com/mozilla-releng/build-cloud-tools/commit/b142bfb0582ed1887a37e5e42d14ec7e1c34331b to try to resolve the tree closure in bug 1363204 (only run Install-BasePrerequisites on the golden ami).

Nick Thomas [:nthomas] (UTC+12)

Comment 17

•

8 years ago

That worked for new instances, but there might still be a netflow issue for the golden AMIs.

Mark Cornmesser [:markco]

Assignee

Comment 18

•

8 years ago

As far as I can tell there has been no changes to github cloud tools repo or Puppet that would have affected Windows golden AMI creation. grenade: I am kind of at a lose here. After hardcoding the url for chocolatly and commented out the reboot after being renamed, I am able to run the userdata script, manual reboot, rerun the script, and it will work through to puppetizing. As well as on the second run the the computername variable is correct. This seems to be only happening when EC2 service is executing the scripts. Any thoughts? Any ideas on where else to look?

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 19

•

8 years ago

presumably the new problem relates to my use of the $domain variable in the url to the puppet server. it's probably worth trying a url of: http://puppet/repos/EXEs/chocolatey/... instead of attempting to build an fqdn as i did in my patches. the hostname "puppet" is *supposed to* resolve to the nearest puppet server (courtesy of our dns settings), so as long as that still works, it may fix the issue. there's probably something screwy in my attempts to build an fqdn to the puppet server for the correct region.

Flags: needinfo?(rthijssen)

Andrei Obreja [:aobreja NOT AVAILABLE][:buildduty]

Updated

•

8 years ago

Blocks: 1363264

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 20

•

8 years ago

ignore discrepancies between "-GOLD" & "-GOLDEN" in computer name check https://github.com/mozilla-releng/build-cloud-tools/commit/b9c1aed49990b4a7e7ad28a90e030219d2634f5f

Mark Cornmesser [:markco]

Assignee

Comment 21

•

8 years ago

Attached file Remove Chocolatly and functions dependent on it — Details

We should be good after this lands. In testing I was able to spin up and capture a y-2008 golden AMI. The instances it spawned successfully finished builds.

Attachment #8866583 - Flags: review?(rthijssen)

Nick Thomas [:nthomas] (UTC+12)

Comment 22

•

8 years ago

I let upstream know we fixed the issue with installing chocolatey on every boot.

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 23

•

8 years ago

Comment on attachment 8866583 [details] [review] Remove Chocolatly and functions dependent on it merged: https://github.com/mozilla-releng/build-cloud-tools/commit/f029f17a73e0f3a6fa2d00fb54cc2b5a95dfc504

Attachment #8866583 - Flags: review?(rthijssen) → review+

Mark Cornmesser [:markco]

Assignee

Comment 24

•

8 years ago

Attached file Replace portion of the Install-RelOpsPrerequisites function — Details

Attachment #8867912 - Flags: review?(nthomas)

Amy Rich [:arr] [:arich]

Comment 25

•

8 years ago

We also noticed that there's a tooltool download failing, and the tokens that should be on disk are missing on those instances that are failing the download. This + the ssh issues feel like some instances are instantiating as loaners and having their secrets wiped: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/Ec2UserdataUtils.psm1#L775 I think the hostname is now matching the loaner string because it's no longer correctly reporting as golden: https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/b-2008.user-data#L126

Amy Rich [:arr] [:arich]

Comment 26

•

8 years ago

None of the log messages found in the ec2data reports sent to the puppet alias show the log lines that would indicate that secrets were purged. Those lines haven't been seen in a report since loaners set up in January.

Nick Thomas [:nthomas] (UTC+12)

Comment 27

•

8 years ago

Comment on attachment 8867912 [details] [review] Replace portion of the Install-RelOpsPrerequisites function I think this one will be superceded.

Attachment #8867912 - Flags: review?(nthomas)

Amy Rich [:arr] [:arich]

Comment 28

•

8 years ago

For testing/debugging purposes, I've created a different AWS config and am using /builds/aws_manager/bin/aws_manager-b-2008-arr-ec2-golden.sh to generate an AMI. It uses /builds/aws_manager/cloud-tools/configs/b-2008-arr and /builds/aws_manager/cloud-tools/configs/b-2008-arr.user-data which reads its psm1 config from my fork of the build-cloud-tools repo instead of the live production repo (so I can make changes without affecting prod). Once testing is done, those three files should be cleaned up.

Amy Rich [:arr] [:arich]

Comment 29

•

8 years ago

I've reverted b9c1aed49990b4a7e7ad28a90e030219d2634f5f to see if that helps until markco is awake.

Blocks: 1365219

Amy Rich [:arr] [:arich]

Comment 30

•

8 years ago

Attached patch enable-spot-userdata-email.patch — Details — Splinter Review

I'm going to temporarily turn on mail for EVERY run, including spot. This is going to generate a flood of email, but it's our best hope of seeing what's happening.

Attachment #8868115 - Flags: review?(aselagea)

Alin Selagea [:aselagea]

Reporter

Comment 31

•

8 years ago

Comment on attachment 8868115 [details] [diff] [review] enable-spot-userdata-email.patch Looks good!

Attachment #8868115 - Flags: review?(aselagea) → review+

Amy Rich [:arr] [:arich]

Comment 32

•

8 years ago

We launched a lot of instances we got data from, so reverted comment 30.

Amy Rich [:arr] [:arich]

Comment 33

•

8 years ago

I've retagged the t-w732 g-w732 b-2008 and y-2008 AMIs created after may 3rd as <host>-1362356. This should roll back the AMIs for new instances created today. I don't think that's going to help, since I think the issue is with UserData, but it returns us to a time when we weren't failing generating the golden and before we made changes to accommodate that.

Amy Rich [:arr] [:arich]

Comment 34

•

8 years ago

Correction, the move to c4.4xlarge was merged on the 2nd, so I rolled back to the AMIs for the 2nd.

Mark Cornmesser [:markco]

Assignee

Comment 35

•

8 years ago

Attached patch bug1362356.patch (obsolete) — Details — Splinter Review

Attachment #8868230 - Flags: review?(arich)

Mark Cornmesser [:markco]

Assignee

Updated

•

8 years ago

Attachment #8868230 - Flags: review?(arich)

Mark Cornmesser [:markco]

Assignee

Comment 36

•

8 years ago

Attached patch bug1362356-2.patch (obsolete) — Details — Splinter Review

Attachment #8868230 - Attachment is obsolete: true

Attachment #8868233 - Flags: review?(arich)

Mark Cornmesser [:markco]

Assignee

Updated

•

8 years ago

Attachment #8868233 - Flags: review?(arich)

Mark Cornmesser [:markco]

Assignee

Comment 37

•

8 years ago

Attached patch bug1362356-3.patch — Details — Splinter Review

Attachment #8868233 - Attachment is obsolete: true

Attachment #8868277 - Flags: review?(arich)

Amy Rich [:arr] [:arich]

Updated

•

8 years ago

Attachment #8868277 - Flags: review?(arich) → review+

Amy Rich [:arr] [:arich]

Comment 38

•

8 years ago

Attached patch e2c-debacle.diff — Details — Splinter Review

There was a lot of debugging today, and Q, markco, and I found and corrected a lot of logic errors and issues. Instead of detailing them all with individual patches, I'm going to include this one large patch that shows the diff between then things were working last and now. We think there were a number of contributing factors to the issues we've seen. First, we think something changed with the way that AWS is reporting env variables, because a bunch of things broke that are based on env vars. There was a broken regex for matching loaner instances. We think that might have also been hitting b-2008 because of the hostname/env var issues. This would have partially run the loaner function and deleted ssh keys. There was a race condition with mounting the ephemeral disk and copying things back to it. The host would sometimes reboot in the middle of that process, so we were missing buildapi tokens (and other files). We also put in a semaphore to make sure that runner would not start until the files were all copied over, just in case we were hitting another race condition where runner would pick a job without the token. We also found that puppet was running on every reboot. That's been disabled except on the golden instance. We were installing spurious packages like chocolatey (which we had added when we thought the future was going to be buildbot+AWS+puppet). We wound up DDOSing the chocolately folks because that was ALSO running and installing chocolately from their infrastructure on every reboot. Rob's original fix for this ripped out a bit too much and we lost some functions we actually needed (nxlog config, windows error reporting, stopping puppet, etc). Those were added back in. Install-MozillaBuildAndPrerequisites was removed because all of this is handled by puppet. Same with Enable-CloneBundle. In addition to this, we still need to fix the golden AMI generation. It's failing do the regex match failure we originally saw when it tried to generate the AMI on the 5th. markco is working on a patch for that now.

Amy Rich [:arr] [:arich]

Comment 39

•

8 years ago

One of the other potential issues was the switch from c3.4xlarge to c4.4xlarge which got reverted. At the very least, it slowed builds down considerably and may have exacerbated some of the timeouts and race conditions we saw.

Amy Rich [:arr] [:arich]

Comment 40

•

8 years ago

This has been running for a couple days and the issues seem to be cleared up.

Assignee: rthijssen → mcornmesser

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.