Closed Bug 1440062 Opened 6 years ago Closed 6 years ago

[MDC2] kickstart releng-puppet1/2 in MDC2

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

Attachments

(3 files, 1 obsolete file)

Screenshot - 02212018 - 02:04:35 PM _ kickstart cfg failure mdc2.png 6 years ago :dhouse 37.36 KB, image/png		Details
Screenshot - 02212018 - 02:04:35 PM _ kickstart cfg failure mdc2.png 6 years ago :dhouse 36.25 KB, image/png		Details
bug1440062_update_moco_config_mdc2.patch 6 years ago Jake Watkins [:dividehex] 14.35 KB, patch	fubar : review+	Details \| Diff \| Splinter Review
14fe5934-840b-48ad-bd60-fc299da4b739.png 6 years ago Jake Watkins [:dividehex] 128.04 KB, image/png		Details

Jake Watkins [:dividehex]

Assignee

Description

•

6 years ago

Before we can kickstart the rest of the VMs in mdc2, we need the puppet environment up and running.

:dhouse

Comment 1

•

6 years ago

Attached image Screenshot - 02212018 - 02:04:35 PM _ kickstart cfg failure mdc2.png (obsolete) — Details

Getting a failure downloading the kickstart cfg file in mdc2. In mdc1 we had this same problem and it was a missing network flow for http to admin1a from [srv,test].releng.mdc1

:dhouse

Updated

•

6 years ago

Assignee: relops → dhouse

:dhouse

Comment 2

•

6 years ago

Attached image Screenshot - 02212018 - 02:04:35 PM _ kickstart cfg failure mdc2.png — Details

I tested reaching over to the mdc1 admin server for the kickstart (also failed). My previous screenshot showed that. I'm updating with the correct mdc2 admin connection failure

Attachment #8952848 - Attachment is obsolete: true

:dhouse

Comment 3

•

6 years ago

Changing the kickstart cfg url in the grub works.
http://admin1.vips.private.mdc2.mozilla.com/kickstart/profiles/pa-c65-64-vmware.cfg
instead of:
http://10.50.75.31/kickstart/profiles/pa-c65-64-vmware.cfg
After that, the centos install needs to download the 6.5 image. It doesn't have anywhere to do that from (cannot reach scl3 or mdc1 puppet masters through port 80/http). So I checked the official centos 6.5 mirror and I'm using that for releng-puppet2; they are not the exact same image. So I'm thinking to image puppet1 first from there, and then I can re-image puppet1 from puppet2 once I have puppet2 built from the correct image on puppet1.

:dhouse

Comment 4

•

6 years ago

http://mirrors.kernel.org/centos/6/os/x86_64/images/install.img
No need to diff, they are a different size:
```
-rw-r--r-- 1 puppetsync puppetsync 144060416 Nov 29  2013 /data/repos/yum/mirrors/centos/6.5/os/x86_64/images/install.img
-rw-rw-r-- 1 dhouse     dhouse     146558976 Mar 28  2017 ./install.img
```

:dhouse

Comment 5

•

6 years ago

I requested network flows to allow HTTP/https/and the puppet ports to the puppet masters from all mdc2 releng vlans (srv, relabs, test, wintest).

:dhouse

Comment 6

•

6 years ago

Some flows were fixed in bug 1440157.
I am kickstarting releng-puppet1 again. i changed the kickstart.cfg hostname to use admin1.vips, and created a "repos" cname to point at releng-puppet1.srv.releng.mdc1 (http connection still fails to scl3 puppetmasters)

:dhouse

Comment 7

•

6 years ago

I created the CNAME "puppet" so that the puppetize script can reach over to the mdc1 puppet master.
Also, I manually pulled the certs.sh because that fails (perhaps python urllib2 is failing on ssl).
```
curl --user user:pass --insecure https://puppet/deploy/getcert.cgi > certs.sh
#then re-start puppetize.sh
```

During the puppet first run, I get a timeout on ssh to the scl3 puppet master:
```
root@releng-puppet1.srv.releng.mdc2.mozilla.com (Cron Daemon) wrote:

> ssh: connect to host releng-puppet2.srv.releng.scl3.mozilla.com port 22: Connection timed out
> rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
> rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]
```
These steps allow the puppetize to continue.

:dhouse

Comment 8

•

6 years ago

(In reply to Dave House [:dhouse] from comment #7)
> During the puppet first run, I get a timeout on ssh to the scl3 puppet
> master:
> ```
> root@releng-puppet1.srv.releng.mdc2.mozilla.com (Cron Daemon) wrote:
> 
> > ssh: connect to host releng-puppet2.srv.releng.scl3.mozilla.com port 22: Connection timed out
> > rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
> > rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]
> ```

so rsync is failing from scl3. I may change it to mdc1 temporarily

> These steps allow the puppetize to continue.

the previous cname and curl of certs.sh allowed the puppetize

:dhouse

Comment 9

•

6 years ago

I've re-kickstarted both releng-puppet1 and releng-puppet2 in mdc2 so that they are continually retrying to puppetize (cert first) against releng-puppet2.srv.releng.scl3 (every 60 seconds until success).

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Depends on: 1441248

Jake Watkins [:dividehex]

Assignee

Comment 10

•

6 years ago

This is on hold until VMware infrastructure is racked, powered on, and available in MDC2.

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Depends on: 1443286

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Depends on: 1443288

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Depends on: 1443289

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Depends on: 1443291

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Blocks: 1443286

No longer depends on: 1443286

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Blocks: 1443291

No longer depends on: 1443291

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Blocks: 1443289

No longer depends on: 1443289

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Blocks: 1443288

No longer depends on: 1443288

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Blocks: 1443306

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Blocks: 1443307

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Blocks: 1443308

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Assignee: dhouse → jwatkins

Jake Watkins [:dividehex]

Assignee

Comment 11

•

6 years ago

Attached patch bug1440062_update_moco_config_mdc2.patch — Details — Splinter Review

Attachment #8958522 - Flags: review?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 12

•

6 years ago

Comment on attachment 8958522 [details] [diff] [review]
bug1440062_update_moco_config_mdc2.patch

Review of attachment 8958522 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm

Attachment #8958522 - Flags: review?(klibby) → review+

Jake Watkins [:dividehex]

Assignee

Comment 13

•

6 years ago

The /data partition and LVM is setup on both masters.  I'm currently rsyncing /data to from releng-puppet2.srv.releng.scl3 to releng-puppet2.srv.releng.mdc2.  From there I'll rsync the /data on between the two mdc2 puppetmasters which should be much faster.

Jake Watkins [:dividehex]

Assignee

Comment 14

•

6 years ago

Both puppet masters are up and running.  The /data partitions have been fully sync'ed and I've added cnames for puppet and repos to test,wintest and srv.  I've also added the new puppet masters to the puppetagain-apt A records.

I was able to successfully reimage t-yosemite-r7-235 and get it to puppetize, so we should be just about ready to go on reimaging the yosemite mac minis.  I just need to double check the default deploystudio group before green lighting :van on the rest of the minis.

When I attempted to kickstart the log aggregators, I ran into the issue :dhouse ran into in c#1, c#2, and c#3.  The url does not fetch the kickstart profile.  It seem dave was able to use the VIPs as a work around but this will need to be troubleshooted and fixed ASAP.  AFAIK, this worked fine when we kickstarted all of releng in MDC1 so why does it not work now?  I'm fairly sure this is a netops/firewall issue since I wasn't able to see any incoming http requests using tcpdump on the admin host that serves the kickstart profile.   The traffic is being blocked somewhere in between.

Chris Knowles [:cknowles]

Comment 15

•

6 years ago

Took a look in panorama as well as the admin1a.private.mdc2 logs.  I see the initial pxe boot work in the admin1a logs, the fetching of the pxe boot and needed bootstrap files.  I do *not* see any attempt at the fetching of the kickstart file.  (which is to say "I see the problem you describe").  The request never seems to reach the admin server.

Looking at Panorama - I had thought we were logging everything, accepts and denies, so I would expect to see entries in the log for the traffic of the fetching of pxe boot - but I don't.  I don't see anything other than a couple of pings being done ~10 hours ago.  So, I'm in the land of "either my assumptions about logging are off, or something is just not right here."

I'm searching in both cases for entries mentioning log-aggregator1's IP of 10.51.48.60.

In the interests of getting some eyes on this - NI'ing rmfd, as he's our netops resource for this project.

Flags: needinfo?(dmurphy)

Jake Watkins [:dividehex]

Assignee

Comment 16

•

6 years ago

Attached image 14fe5934-840b-48ad-bd60-fc299da4b739.png — Details

I was able to replicate this from the puppet master in mdc2 since I wasn't able to breakout into a shell from the centos installer on the log aggregator.  Using curl on the master and tcpdump on the admin hosts show no traffic.  cknowles was kind enough to check the panorama logs and screenshot the denies.  See attached picture.

Chris Knowles [:cknowles]

Comment 17

•

6 years ago

Looking further at the rules - I can't see anywhere where the admin host (10.[48,50].75.31) is allowed to HTTP things to the rest of things.  However, the VIPs are totally allowed, per the rules, to do so.  (10.[48,50].122.5).

So, two paths I can see - either we go with how things are set in the firewall currently and use the VIPs, *or* we shift the rules that mention the VIPs to mention the direct addresses as well.

Jake Watkins [:dividehex]

Assignee

Comment 18

•

6 years ago

(In reply to Chris Knowles [:cknowles] from comment #17)
 
> So, two paths I can see - either we go with how things are set in the
> firewall currently and use the VIPs, *or* we shift the rules that mention
> the VIPs to mention the direct addresses as well.

I agree with using the VIPs as the 'right' way of fixing this.

Chris Knowles [:cknowles]

Comment 19

•

6 years ago

Alright, in looking at hiera/datacenter/mdc2.yaml there's a line:

# Update to admin1.vips.private.mdc2 once Bug 1428843 is resolved
pxe_ks_listen_ip: '10.50.75.31'

That bug is resolved, so updated to 10.50.122.5 and committed in abc7d820ec0c32c4071bcaf3ee7d1fc393abd723

Let me know if I can do anything else.

Jake Watkins [:dividehex]

Assignee

Comment 20

•

6 years ago

(In reply to Chris Knowles [:cknowles] from comment #19)
> Alright, in looking at hiera/datacenter/mdc2.yaml there's a line:
> 
> # Update to admin1.vips.private.mdc2 once Bug 1428843 is resolved
> pxe_ks_listen_ip: '10.50.75.31'
> 
> That bug is resolved, so updated to 10.50.122.5 and committed in
> abc7d820ec0c32c4071bcaf3ee7d1fc393abd723
> 
> Let me know if I can do anything else.

Thanks! Puppet had propagated and kickstarts are working in MDC2.

<clearing NI on dmurphy>

Flags: needinfo?(dmurphy)

Jake Watkins [:dividehex]

Assignee

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.