Closed
Bug 1362356
Opened 8 years ago
Closed 8 years ago
Bad computer name env. variable for Windows AMIs
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aselagea, Assigned: markco)
References
Details
Attachments
(5 files, 2 obsolete files)
|
64 bytes,
text/x-github-pull-request
|
grenade
:
review+
|
Details | Review |
|
60 bytes,
text/x-github-pull-request
|
Details | Review | |
|
1.44 KB,
patch
|
aselagea
:
review+
|
Details | Diff | Splinter Review |
|
495 bytes,
patch
|
arich
:
review+
|
Details | Diff | Splinter Review |
|
13.09 KB,
patch
|
Details | Diff | Splinter Review |
Noticed some issues with the AMI generation process for b/y-2008 and t/g-w732 today due to a wrongly set computer name environment variable. That caused the instances to change the hostname and then reboot, but since that wasn't actually fixing the issue, it resulted in an infinite loop of reboots which were overloading the puppet masters.
e.g.
2017-05-05 02:57:46 -07:00 [INFO] Primary DNS suffix set to: test.releng.use1.mozilla.com
2017-05-05 02:57:46 -07:00 [DEBUG] net dns hostname: g-w732-ec2-golden, expected: g-w732-ec2-golden
2017-05-05 02:57:46 -07:00 [DEBUG] computer name env var: G-W732-EC2-GOLD, expected: g-w732-ec2-golden
2017-05-05 02:57:47 -07:00 [INFO] hostname set to: g-w732-ec2-golden
2017-05-05 02:57:47 -07:00 [INFO] shutting down with reason: host renamed
| Reporter | ||
Comment 1•8 years ago
|
||
I terminated the instances and re-ran the scripts for those AMIs, but got into the same issues again.
@markco: did something change in the AMI generation process recently?
Flags: needinfo?(mcornmesser)
Comment 2•8 years ago
|
||
if there are new amis with a creation date right before the problem started happening, i would just delete the new amis (and any created since) in both regions. this would force use of last known good ami.
| Reporter | ||
Comment 3•8 years ago
|
||
No, we don't have new AMIs for Windows today. So we're using the AMIs from yesterday (which are fine).
| Assignee | ||
Comment 4•8 years ago
|
||
(In reply to Alin Selagea [:aselagea][:buildduty] from comment #1)
> I terminated the instances and re-ran the scripts for those AMIs, but got
> into the same issues again.
>
> @markco: did something change in the AMI generation process recently?
Nothing that I am aware of. I will take a look into it today.
Flags: needinfo?(mcornmesser)
Comment 5•8 years ago
|
||
I know mark was taking a look at this yesterday, but it appears that it hasn't been solved. THere were a ton of puppet attempts when the cron jobs kicked off last night. I've killed off those processes on aws-manager2. For the time being, I'm just going to leave the golden instance up so it doesn't try to do the same thing tomorrow (it'll just fail saying that the IP is already allocated).
Assignee: nobody → mcornmesser
Component: Buildduty → RelOps
Product: Release Engineering → Infrastructure & Operations
QA Contact: bugspam.Callek → arich
Comment 6•8 years ago
|
||
I take it back, they were still trying to run puppet, so I terminated the instances to stop the puppet spam.
Comment 7•8 years ago
|
||
Today we have nagios alerts like:
<relengbot> [sns alert] May 07 01:13:01 aws-manager2.srv.releng.scl3.mozilla.com try-linux64-ec2-golden: lockfile: Sorry, giving up on "/builds/aws_manager/try-linux64-ec2-golden.lock"
Also for bld-linux64-ec2-golden, av-linux64-ec2-golden, tst-linux64-ec2-golden, tst-linux32-ec2-golden, y-2008-ec2-golden, b-2008-ec2-golden, g-w732-ec2-golden, t-w732-ec2-golden, tst-emulator64-ec2-golden. Maybe from earlier interventions, or crond has started multiple jobs. A papertrail search indicates the latter ('CROND tst-emulator64-ec2-golden' turns up two for the 6th and 7th, with the same timestamp; only one earlier). We've seen cron get confused before so I've restarted crond on aws-manager2.
Comment 8•8 years ago
|
||
Then there are alerts like:
01:38 <nagios-releng> Sun 06:38:43 PDT [4023] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 10 crit, 0 warn out of 20 processes with args ec2-golden
03:38 <nagios-releng> Sun 08:38:42 PDT [4027] [] aws-manager2.srv.releng.scl3.mozilla.com:procs age - golden AMI is CRITICAL: ELAPSED CRITICAL: 20 crit, 0 warn out of 20 processes with args ec2-golden
It's actually 4 golden AMI generations that have gone wrong - g-w732, t-w732, b-2008, y-2008
There appear to be a couple of errors in the puppet mail:
1. chocolatey might have blacklisted us
2017-05-07T23:33:38.703Z: Ec2HandleUserData: Message: The errors from user scripts: Exception calling "DownloadString" with "1" argument(s): "The remote server ret
urned an error: (403) Forbidden."
At C:\windows\system32\WindowsPowerShell\v1.0\Modules\Ec2UserdataUtils\Ec2Userd
ataUtils.psm1:1746 char:70
+ Invoke-Expression ((New-Object System.Net.WebClient).DownloadString <<<< ('
https://chocolatey.org/install.ps1'))
+ CategoryInfo : NotSpecified: (:) [], MethodInvocationException
+ FullyQualifiedErrorId : DotNetMethodException
And if I do this from buildbot-master82:
$ wget https://chocolatey.org/install.ps1
--2017-05-07 17:08:15-- https://chocolatey.org/install.ps1
Resolving chocolatey.org... 104.20.73.28, 104.20.74.28, 2400:cb00:2048:1::6814:4a1c, ...
Connecting to chocolatey.org|104.20.73.28|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2017-05-07 17:08:15 ERROR 403: Forbidden.
Works fine locally, so I suspect they've blocked the SCL3 NAT (63.245.214.82). I found this tweet: https://twitter.com/ferventcoder/status/860245736973316098
If you just started seeing 403 errors with the #chocolatey community repository today, please reach out to us at https://gitter.im/chocolatey/choco
Does that match up ? I've commented on that glitter.
2, sublimetest3 isn't installing
Userdata: Install-Package :: sublimetext3.packagecontrol install failed with exit code: 1#015
The last time we successfully built windows AMI are 2015-05-04.
Comment 9•8 years ago
|
||
I mentinoed in irc that I killed off the instances (to stop the puppet email) but left the processes running on aws-manager2 (to stop them from kicking off again tonight) to give markco a chance to look at them tomorrow.
Comment 10•8 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #8)
> The last time we successfully built windows AMI are 2015-05-04.
Correction - 2017-05-04.
I spoke with Rob Reynolds of Chocolatey (via Gitter chat) and he confirmed that they've started blocking our requests. Their server stats have about 4 million package installs from our IP over the last 30 days, the vast majority of them of chocolatey itself. His suggestions are to
* check for this chef bug - https://github.com/chocolatey/chocolatey-cookbook/issues/105
* install chocolatey from an internal location
This can only be our buildbot infra given it's from the single NAT IP, which I thinks means t/g-w732 and b/y-2008 (not used on the ix machines). It's almost like we re-install every time we start a spot instance, although that would still be an average of 130k instance starts a day, which seems high.
AIUI the install happens in Install-BasePrerequisites
https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/Ec2UserdataUtils.psm1#L1740
called by b-2008.user-data (which is symlinked to the other worker pools mentioned above):
https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/b-2008.user-data#L110
Perhaps we just move Install-BasePrerequisites into Prep-Golden. Over to Mark now for proper debugging. Rob R. said to let them know when they can drop the block.
| Reporter | ||
Comment 11•8 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #7)
> Today we have nagios alerts like:
> <relengbot> [sns alert] May 07 01:13:01
> aws-manager2.srv.releng.scl3.mozilla.com try-linux64-ec2-golden: lockfile:
> Sorry, giving up on "/builds/aws_manager/try-linux64-ec2-golden.lock"
>
> Also for bld-linux64-ec2-golden, av-linux64-ec2-golden,
> tst-linux64-ec2-golden, tst-linux32-ec2-golden, y-2008-ec2-golden,
> b-2008-ec2-golden, g-w732-ec2-golden, t-w732-ec2-golden,
> tst-emulator64-ec2-golden. Maybe from earlier interventions, or crond has
> started multiple jobs. A papertrail search indicates the latter ('CROND
> tst-emulator64-ec2-golden' turns up two for the 6th and 7th, with the same
> timestamp; only one earlier). We've seen cron get confused before so I've
> restarted crond on aws-manager2.
The same alerts were spotted again today in #buildduty.
There were two crond instances running on aws-manager.
root 5852 1 0 Feb23 ? 00:01:40 crond
root 6325 1 0 May07 ? 00:00:00 crond
For some reason, running "/etc/init.d/crond restart" will not stop the older process so it will result in two different processes. I killed both of them, then started crond again. Now we only have one process running.
Comment 12•8 years ago
|
||
Assigning this to rob to fix the chocolatey issues. I'm not sure that's the underlying problem, but we need to fix that, regardless.
Assignee: mcornmesser → rthijssen
Comment 13•8 years ago
|
||
install chocolatey from internal infra:
https://github.com/mozilla-releng/build-cloud-tools/commit/940ca418fd9c33f6ed91837576f5f47ac8c4cb3b
still need to alter the script at http://releng-puppet2.srv.releng.scl3.mozilla.com/repos/EXEs/chocolatey/install.ps1 to make sure it also downloads any artifacts it needs from infra.
Comment 14•8 years ago
|
||
install chocolatey from internal infra:
https://github.com/mozilla-releng/build-cloud-tools/commit/fcaa07a06044ed3ad40601bf86b78207efb0650e
this should resolve the high traffic to chocolatey.org
Comment 15•8 years ago
|
||
Comment 16•8 years ago
|
||
Landed https://github.com/mozilla-releng/build-cloud-tools/commit/b142bfb0582ed1887a37e5e42d14ec7e1c34331b to try to resolve the tree closure in bug 1363204 (only run Install-BasePrerequisites on the golden ami).
Comment 17•8 years ago
|
||
That worked for new instances, but there might still be a netflow issue for the golden AMIs.
| Assignee | ||
Comment 18•8 years ago
|
||
As far as I can tell there has been no changes to github cloud tools repo or Puppet that would have affected Windows golden AMI creation.
grenade: I am kind of at a lose here. After hardcoding the url for chocolatly and commented out the reboot after being renamed, I am able to run the userdata script, manual reboot, rerun the script, and it will work through to puppetizing. As well as on the second run the the computername variable is correct. This seems to be only happening when EC2 service is executing the scripts. Any thoughts? Any ideas on where else to look?
Flags: needinfo?(rthijssen)
Comment 19•8 years ago
|
||
presumably the new problem relates to my use of the $domain variable in the url to the puppet server. it's probably worth trying a url of: http://puppet/repos/EXEs/chocolatey/... instead of attempting to build an fqdn as i did in my patches. the hostname "puppet" is *supposed to* resolve to the nearest puppet server (courtesy of our dns settings), so as long as that still works, it may fix the issue. there's probably something screwy in my attempts to build an fqdn to the puppet server for the correct region.
Flags: needinfo?(rthijssen)
Comment 20•8 years ago
|
||
ignore discrepancies between "-GOLD" & "-GOLDEN" in computer name check
https://github.com/mozilla-releng/build-cloud-tools/commit/b9c1aed49990b4a7e7ad28a90e030219d2634f5f
| Assignee | ||
Comment 21•8 years ago
|
||
We should be good after this lands. In testing I was able to spin up and capture a y-2008 golden AMI. The instances it spawned successfully finished builds.
Attachment #8866583 -
Flags: review?(rthijssen)
Comment 22•8 years ago
|
||
I let upstream know we fixed the issue with installing chocolatey on every boot.
Comment 23•8 years ago
|
||
Comment on attachment 8866583 [details] [review]
Remove Chocolatly and functions dependent on it
merged: https://github.com/mozilla-releng/build-cloud-tools/commit/f029f17a73e0f3a6fa2d00fb54cc2b5a95dfc504
Attachment #8866583 -
Flags: review?(rthijssen) → review+
| Assignee | ||
Comment 24•8 years ago
|
||
Attachment #8867912 -
Flags: review?(nthomas)
Comment 25•8 years ago
|
||
We also noticed that there's a tooltool download failing, and the tokens that should be on disk are missing on those instances that are failing the download. This + the ssh issues feel like some instances are instantiating as loaners and having their secrets wiped:
https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/Ec2UserdataUtils.psm1#L775
I think the hostname is now matching the loaner string because it's no longer correctly reporting as golden:
https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/b-2008.user-data#L126
Comment 26•8 years ago
|
||
None of the log messages found in the ec2data reports sent to the puppet alias show the log lines that would indicate that secrets were purged. Those lines haven't been seen in a report since loaners set up in January.
Comment 27•8 years ago
|
||
Comment on attachment 8867912 [details] [review]
Replace portion of the Install-RelOpsPrerequisites function
I think this one will be superceded.
Attachment #8867912 -
Flags: review?(nthomas)
Comment 28•8 years ago
|
||
For testing/debugging purposes, I've created a different AWS config and am using /builds/aws_manager/bin/aws_manager-b-2008-arr-ec2-golden.sh to generate an AMI. It uses /builds/aws_manager/cloud-tools/configs/b-2008-arr and
/builds/aws_manager/cloud-tools/configs/b-2008-arr.user-data which reads its psm1 config from my fork of the
build-cloud-tools repo instead of the live production repo (so I can make changes without affecting prod).
Once testing is done, those three files should be cleaned up.
Comment 29•8 years ago
|
||
I've reverted b9c1aed49990b4a7e7ad28a90e030219d2634f5f to see if that helps until markco is awake.
Blocks: 1365219
Comment 30•8 years ago
|
||
I'm going to temporarily turn on mail for EVERY run, including spot. This is going to generate a flood of email, but it's our best hope of seeing what's happening.
Attachment #8868115 -
Flags: review?(aselagea)
| Reporter | ||
Comment 31•8 years ago
|
||
Comment on attachment 8868115 [details] [diff] [review]
enable-spot-userdata-email.patch
Looks good!
Attachment #8868115 -
Flags: review?(aselagea) → review+
Comment 32•8 years ago
|
||
We launched a lot of instances we got data from, so reverted comment 30.
Comment 33•8 years ago
|
||
I've retagged the t-w732 g-w732 b-2008 and y-2008 AMIs created after may 3rd as <host>-1362356. This should roll back the AMIs for new instances created today. I don't think that's going to help, since I think the issue is with UserData, but it returns us to a time when we weren't failing generating the golden and before we made changes to accommodate that.
Comment 34•8 years ago
|
||
Correction, the move to c4.4xlarge was merged on the 2nd, so I rolled back to the AMIs for the 2nd.
| Assignee | ||
Comment 35•8 years ago
|
||
Attachment #8868230 -
Flags: review?(arich)
| Assignee | ||
Updated•8 years ago
|
Attachment #8868230 -
Flags: review?(arich)
| Assignee | ||
Comment 36•8 years ago
|
||
Attachment #8868230 -
Attachment is obsolete: true
Attachment #8868233 -
Flags: review?(arich)
| Assignee | ||
Updated•8 years ago
|
Attachment #8868233 -
Flags: review?(arich)
| Assignee | ||
Comment 37•8 years ago
|
||
Attachment #8868233 -
Attachment is obsolete: true
Attachment #8868277 -
Flags: review?(arich)
Updated•8 years ago
|
Attachment #8868277 -
Flags: review?(arich) → review+
Comment 38•8 years ago
|
||
There was a lot of debugging today, and Q, markco, and I found and corrected a lot of logic errors and issues. Instead of detailing them all with individual patches, I'm going to include this one large patch that shows the diff between then things were working last and now.
We think there were a number of contributing factors to the issues we've seen. First, we think something changed with the way that AWS is reporting env variables, because a bunch of things broke that are based on env vars.
There was a broken regex for matching loaner instances. We think that might have also been hitting b-2008 because of the hostname/env var issues. This would have partially run the loaner function and deleted ssh keys.
There was a race condition with mounting the ephemeral disk and copying things back to it. The host would sometimes reboot in the middle of that process, so we were missing buildapi tokens (and other files). We also put in a semaphore to make sure that runner would not start until the files were all copied over, just in case we were hitting another race condition where runner would pick a job without the token.
We also found that puppet was running on every reboot. That's been disabled except on the golden instance.
We were installing spurious packages like chocolatey (which we had added when we thought the future was going to be buildbot+AWS+puppet). We wound up DDOSing the chocolately folks because that was ALSO running and installing chocolately from their infrastructure on every reboot. Rob's original fix for this ripped out a bit too much and we lost some functions we actually needed (nxlog config, windows error reporting, stopping puppet, etc). Those were added back in.
Install-MozillaBuildAndPrerequisites was removed because all of this is handled by puppet. Same with Enable-CloneBundle.
In addition to this, we still need to fix the golden AMI generation. It's failing do the regex match failure we originally saw when it tried to generate the AMI on the 5th. markco is working on a patch for that now.
Comment 39•8 years ago
|
||
One of the other potential issues was the switch from c3.4xlarge to c4.4xlarge which got reverted. At the very least, it slowed builds down considerably and may have exacerbated some of the timeouts and race conditions we saw.
Comment 40•8 years ago
|
||
This has been running for a couple days and the issues seem to be cleared up.
Assignee: rthijssen → mcornmesser
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•