Closed Bug 1440062 Opened 6 years ago Closed 6 years ago

[MDC2] kickstart releng-puppet1/2 in MDC2

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

Attachments

(3 files, 1 obsolete file)

Before we can kickstart the rest of the VMs in mdc2, we need the puppet environment up and running.
Getting a failure downloading the kickstart cfg file in mdc2. In mdc1 we had this same problem and it was a missing network flow for http to admin1a from [srv,test].releng.mdc1
Assignee: relops → dhouse
I tested reaching over to the mdc1 admin server for the kickstart (also failed). My previous screenshot showed that. I'm updating with the correct mdc2 admin connection failure
Attachment #8952848 - Attachment is obsolete: true
Changing the kickstart cfg url in the grub works.
http://admin1.vips.private.mdc2.mozilla.com/kickstart/profiles/pa-c65-64-vmware.cfg
instead of:
http://10.50.75.31/kickstart/profiles/pa-c65-64-vmware.cfg
After that, the centos install needs to download the 6.5 image. It doesn't have anywhere to do that from (cannot reach scl3 or mdc1 puppet masters through port 80/http). So I checked the official centos 6.5 mirror and I'm using that for releng-puppet2; they are not the exact same image. So I'm thinking to image puppet1 first from there, and then I can re-image puppet1 from puppet2 once I have puppet2 built from the correct image on puppet1.
http://mirrors.kernel.org/centos/6/os/x86_64/images/install.img
No need to diff, they are a different size:
```
-rw-r--r-- 1 puppetsync puppetsync 144060416 Nov 29  2013 /data/repos/yum/mirrors/centos/6.5/os/x86_64/images/install.img
-rw-rw-r-- 1 dhouse     dhouse     146558976 Mar 28  2017 ./install.img
```
I requested network flows to allow HTTP/https/and the puppet ports to the puppet masters from all mdc2 releng vlans (srv, relabs, test, wintest).
Some flows were fixed in bug 1440157.
I am kickstarting releng-puppet1 again. i changed the kickstart.cfg hostname to use admin1.vips, and created a "repos" cname to point at releng-puppet1.srv.releng.mdc1 (http connection still fails to scl3 puppetmasters)
I created the CNAME "puppet" so that the puppetize script can reach over to the mdc1 puppet master.
Also, I manually pulled the certs.sh because that fails (perhaps python urllib2 is failing on ssl).
```
curl --user user:pass --insecure https://puppet/deploy/getcert.cgi > certs.sh
#then re-start puppetize.sh
```

During the puppet first run, I get a timeout on ssh to the scl3 puppet master:
```
root@releng-puppet1.srv.releng.mdc2.mozilla.com (Cron Daemon) wrote:

> ssh: connect to host releng-puppet2.srv.releng.scl3.mozilla.com port 22: Connection timed out
> rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
> rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]
```
These steps allow the puppetize to continue.
(In reply to Dave House [:dhouse] from comment #7)
> During the puppet first run, I get a timeout on ssh to the scl3 puppet
> master:
> ```
> root@releng-puppet1.srv.releng.mdc2.mozilla.com (Cron Daemon) wrote:
> 
> > ssh: connect to host releng-puppet2.srv.releng.scl3.mozilla.com port 22: Connection timed out
> > rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
> > rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]
> ```

so rsync is failing from scl3. I may change it to mdc1 temporarily

> These steps allow the puppetize to continue.

the previous cname and curl of certs.sh allowed the puppetize
I've re-kickstarted both releng-puppet1 and releng-puppet2 in mdc2 so that they are continually retrying to puppetize (cert first) against releng-puppet2.srv.releng.scl3 (every 60 seconds until success).
Depends on: 1441248
This is on hold until VMware infrastructure is racked, powered on, and available in MDC2.
Depends on: 1443286
Depends on: 1443288
Depends on: 1443289
Depends on: 1443291
Blocks: 1443286
No longer depends on: 1443286
Blocks: 1443291
No longer depends on: 1443291
Blocks: 1443289
No longer depends on: 1443289
Blocks: 1443288
No longer depends on: 1443288
Blocks: 1443306
Blocks: 1443307
Blocks: 1443308
Assignee: dhouse → jwatkins
Comment on attachment 8958522 [details] [diff] [review]
bug1440062_update_moco_config_mdc2.patch

Review of attachment 8958522 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8958522 - Flags: review?(klibby) → review+
The /data partition and LVM is setup on both masters.  I'm currently rsyncing /data to from releng-puppet2.srv.releng.scl3 to releng-puppet2.srv.releng.mdc2.  From there I'll rsync the /data on between the two mdc2 puppetmasters which should be much faster.
Both puppet masters are up and running.  The /data partitions have been fully sync'ed and I've added cnames for puppet and repos to test,wintest and srv.  I've also added the new puppet masters to the puppetagain-apt A records.

I was able to successfully reimage t-yosemite-r7-235 and get it to puppetize, so we should be just about ready to go on reimaging the yosemite mac minis.  I just need to double check the default deploystudio group before green lighting :van on the rest of the minis.

When I attempted to kickstart the log aggregators, I ran into the issue :dhouse ran into in c#1, c#2, and c#3.  The url does not fetch the kickstart profile.  It seem dave was able to use the VIPs as a work around but this will need to be troubleshooted and fixed ASAP.  AFAIK, this worked fine when we kickstarted all of releng in MDC1 so why does it not work now?  I'm fairly sure this is a netops/firewall issue since I wasn't able to see any incoming http requests using tcpdump on the admin host that serves the kickstart profile.   The traffic is being blocked somewhere in between.
Took a look in panorama as well as the admin1a.private.mdc2 logs.  I see the initial pxe boot work in the admin1a logs, the fetching of the pxe boot and needed bootstrap files.  I do *not* see any attempt at the fetching of the kickstart file.  (which is to say "I see the problem you describe").  The request never seems to reach the admin server.

Looking at Panorama - I had thought we were logging everything, accepts and denies, so I would expect to see entries in the log for the traffic of the fetching of pxe boot - but I don't.  I don't see anything other than a couple of pings being done ~10 hours ago.  So, I'm in the land of "either my assumptions about logging are off, or something is just not right here."

I'm searching in both cases for entries mentioning log-aggregator1's IP of 10.51.48.60.

In the interests of getting some eyes on this - NI'ing rmfd, as he's our netops resource for this project.
Flags: needinfo?(dmurphy)
I was able to replicate this from the puppet master in mdc2 since I wasn't able to breakout into a shell from the centos installer on the log aggregator.  Using curl on the master and tcpdump on the admin hosts show no traffic.  cknowles was kind enough to check the panorama logs and screenshot the denies.  See attached picture.
Looking further at the rules - I can't see anywhere where the admin host (10.[48,50].75.31) is allowed to HTTP things to the rest of things.  However, the VIPs are totally allowed, per the rules, to do so.  (10.[48,50].122.5).

So, two paths I can see - either we go with how things are set in the firewall currently and use the VIPs, *or* we shift the rules that mention the VIPs to mention the direct addresses as well.
(In reply to Chris Knowles [:cknowles] from comment #17)
 
> So, two paths I can see - either we go with how things are set in the
> firewall currently and use the VIPs, *or* we shift the rules that mention
> the VIPs to mention the direct addresses as well.

I agree with using the VIPs as the 'right' way of fixing this.
Alright, in looking at hiera/datacenter/mdc2.yaml there's a line:

# Update to admin1.vips.private.mdc2 once Bug 1428843 is resolved
pxe_ks_listen_ip: '10.50.75.31'

That bug is resolved, so updated to 10.50.122.5 and committed in abc7d820ec0c32c4071bcaf3ee7d1fc393abd723

Let me know if I can do anything else.
(In reply to Chris Knowles [:cknowles] from comment #19)
> Alright, in looking at hiera/datacenter/mdc2.yaml there's a line:
> 
> # Update to admin1.vips.private.mdc2 once Bug 1428843 is resolved
> pxe_ks_listen_ip: '10.50.75.31'
> 
> That bug is resolved, so updated to 10.50.122.5 and committed in
> abc7d820ec0c32c4071bcaf3ee7d1fc393abd723
> 
> Let me know if I can do anything else.

Thanks! Puppet had propagated and kickstarts are working in MDC2.

<clearing NI on dmurphy>
Flags: needinfo?(dmurphy)
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: