Closed Bug 1130176 Opened 9 years ago Closed 9 years ago

Re-create AWS buildbot masters with CentOS 6.5

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(3 files)

First, I'll update the instance definition to use the new CentOS 6.5 hvm-base AMI.  Then, I'll start creating new instances in the BB VLAN, complete with puppetization under the new hostname.  Once the old host finishes its graceful stop, I'll start up the new master.

This will require some inventory changes, updates to production-masters.json, and updates to puppet.  It'd be a little easier to reimage in place, but then we'll still be in the srv VLAN.
I built a temporary buildbot-master999 to test this out.
My test master was built with an updated 'buildbot-master' profile, differing from server-linux64 only in subnet/domain, instance_type, moz-type, and instance_profile_name.  The result looks like:

[root@buildbot-master999.bb.releng.use1.mozilla.com ~]# df -h
Filesystem                      Size  Used Avail Use% Mounted on
/dev/mapper/cloud_root-lv_root   35G  5.6G   28G  18% /
none                            1.9G     0  1.9G   0% /dev/shm
/dev/xvda1                       60M   44M   13M  79% /boot

while a "real" and reasonably recent master looks like

[root@buildbot-master76.srv.releng.use1.mozilla.com ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvde1             15G  7.0G  7.1G  50% /
none                  1.9G  236K  1.9G   1% /dev/shm
/dev/mapper/instances_storage-builds
                      4.0G  4.0G     0 100% /mnt/instance_storage

That last bit is, swap on ephemeral storage, via LVM for reasons I can't fathom:

[root@buildbot-master76.srv.releng.use1.mozilla.com ~]# cat /proc/swaps 
Filename                                Type            Size    Used    Priority
/mnt/instance_storage/swap_file         file            4046652 13404   -1
[root@buildbot-master76.srv.releng.use1.mozilla.com ~]# pvs
  PV         VG                Fmt  Attr PSize PFree
  /dev/xvdf  instances_storage lvm2 a--  3.99g    0 

From what I understand, the extra 20G on the root volume won't hurt anything (and to change it would require baking a new base image).  I need to figure out how the swap is set up and replicate it here, but otherwise this looks ready to roll.

In other news, per email we're good to go on the move to the BB VLAN.  It will involve some additional steps, but I'll work those out today.
The add_swap init script looks for the first mounted device under /mnt or /builds, "Known SSD mount points".  So the question is, why is the isntance storage not mounted there?  Something cloud-init related?
Rail and I discussed options here.  There are a few moving parts:
 * aws_create_instance creates a BlockDeviceMapping based on `device_map` in `config/buildbot-master`; for EBS-backed isntances (which these are), this is required to get access to ephemeral volumes.
 * on startup, cloud-init reads instance metadata which is generated from `config/buildbot-master.cloud-init`.  cloud-init can mount things.
 * there's puppet code to add /etc/init.d/add_swap which looks for a volume mounted at /mnt or /boot and makes a swap file on it.

It looks like the existing AWS masters had some of this set up by hand -- we're not sure how.  So the question is, how best to re-implement it.

The option we seem to have landed on is to configure only one ephemeral drive (these are m3.medium's, so they only have one 4GB SSD anyway) in `device_map`, then convince cloud-init to mount that at /mnt.  At that point, add_swap will find and use the swap.

This didn't work:

  mounts:
   - [ ephemeral0, none, swap ]

I don't see anything about mounting in the console logs, either.
The device_mapping bit is working fine -- /dev/xvdb shows up in lsblk, but it's unused.

Omitting 'mounts' entirely (which in theory defaults to mounting ephemeral storage at /mnt) didn't work either.

Trying (straight from the examples):

  mounts:
   - [ ephemeral0, /mnt, auto, "defaults,noexec" ]

didn't work, either.  But that data may be invalid - it seems that cloud-init only runs the 'mounts' module once per instance, so I've really only tested [ ephemeral0, none, swap ].
Re-creating with 'mounts' entirely omitted didn't work either.  Running `/etc/init.d/cloud-init start` said

Starting cloud-init: /usr/lib/python2.6/site-packages/cloudinit/url_helper.py:40: UserWarning: Module backports was already imported from /usr/lib64/python2.6/site-packages/backports/__init__.pyc, but /usr/lib/python2.6/site-packages is being added to sys.path
  import pkg_resources
Cloud-init v. 0.7.4 running 'init' at Fri, 06 Feb 2015 23:57:59 +0000. Up 284.46 seconds.
ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
ci-info: +--------+------+--------------+-----------------+-------------------+
ci-info: | Device |  Up  |   Address    |       Mask      |     Hw-Address    |
ci-info: +--------+------+--------------+-----------------+-------------------+
ci-info: |   lo   | True |  127.0.0.1   |    255.0.0.0    |         .         |
ci-info: |  eth0  | True | 10.134.68.16 | 255.255.255.192 | 0e:39:52:1d:2f:eb |
ci-info: +--------+------+--------------+-----------------+-------------------+
ci-info: ++++++++++++++++++++++++++++++++Route info+++++++++++++++++++++++++++++++++
ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
ci-info: | Route | Destination |   Gateway   |     Genmask     | Interface | Flags |
ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
ci-info: |   0   | 10.134.68.0 |   0.0.0.0   | 255.255.255.192 |    eth0   |   U   |
ci-info: |   1   | 169.254.0.0 |   0.0.0.0   |   255.255.0.0   |    eth0   |   U   |
ci-info: |   2   |   0.0.0.0   | 10.134.68.1 |     0.0.0.0     |    eth0   |   UG  |
ci-info: +-------+-------------+-------------+-----------------+-----------+-------+
2015-02-06 18:58:01,045 - util.py[WARNING]: Failed to disable password for user cloud-user
2015-02-06 18:58:01,046 - util.py[WARNING]: Running users-groups (<module 'cloudinit.config.cc_users_groups' from '/usr/lib/python2.6/site-packages/cloudinit/config/cc_users_groups.pyc'>) failed

cloud-init goes out of its way to make sure its logging is black-holed, unfortunately -- it closes stdout and frequently resets Python's logging library so to silence output.
This seems to work, on some manual futzing:

  mounts:
   - [ ephemeral0, /mnt, auto, "defaults,noexec" ]

But it turns out that the 'mounts' module runs in the 'config' mode, while /etc/init.d/cloud-init start only runs the 'init' mode.  I'm going to try again with a fresh AMI creation.
Well, that sorta worked, but add_swap still created the swap volume on the root disk:

[root@buildbot-master999.bb.releng.use1.mozilla.com ~]# cat /etc/fstab 
LABEL=root_dev   /         ext4   defaults,noatime,nodiratime,commit=60        1 1
/dev/xvda1 /boot ext2 rw 0 0
none       /proc     proc    defaults        0 0
none       /sys      sysfs   defaults        0 0
none       /dev/pts  devpts  gid=5,mode=620  0 0
none       /dev/shm  tmpfs   defaults        0 0

/dev/xvdb       /mnt    auto    defaults,noexec,comment=cloudconfig     0       2

[root@buildbot-master999.bb.releng.use1.mozilla.com ~]# swapon -s
Filename                                Type            Size    Used    Priority
/swap_file                              file            4194296 0       -1
So part of that is caused by add_swap searching for ext4, when this filesystem is ext3.

Another part of it is that the filesystem has only 3.7 GB available, while add_swap tries to allocate a 4GB swap.

[root@buildbot-master999.bb.releng.use1.mozilla.com ~]# lsblk
NAME                          MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvda                          202:0    0  35G  0 disk
├─xvda1                       202:1    0  61M  0 part /boot
└─xvda2                       202:2    0  35G  0 part
  └─cloud_root-lv_root (dm-0) 253:0    0  35G  0 lvm  /
xvdb                          202:16   0   4G  0 disk /mnt
[root@buildbot-master999.bb.releng.use1.mozilla.com ~]# df -h /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvdb       4.0G   73M  3.7G   2% /mnt

whereas

[root@buildbot-master71.srv.releng.use1.mozilla.com ~]# lsblk
NAME                              MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvde1                             202:65   0  15G  0 disk /
xvdf                              202:80   0   4G  0 disk 
└─instances_storage-builds (dm-0) 253:0    0   4G  0 lvm  /mnt/instance_storage
[root@buildbot-master71.srv.releng.use1.mozilla.com ~]# pvs
  PV         VG                Fmt  Attr PSize PFree
  /dev/xvdf  instances_storage lvm2 a--  3.99g    0 
[root@buildbot-master71.srv.releng.use1.mozilla.com ~]# df -h /mnt/instance_storage/
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/instances_storage-builds
                      4.0G  4.0G     0 100% /mnt/instance_storage

I don't know how it allocates a full 4GB file on a 3.99g disk, with room for filesystem overhead.  Well, I do really, it cheats:

[root@buildbot-master71.srv.releng.use1.mozilla.com ~]# ls -al /mnt/instance_storage/swap_file
-rw------- 1 root root 4143779840 Sep 30 00:26 /mnt/instance_storage/swap_file
                       ^^^^^^^^^^ not quite 4G!

So I think I'll try to figure out how much free space is on the disk and fallocate that.
I switched things around to follow http://blog.brianbeach.com/2014/12/configuring-linux-swap-device-with.html:

mounts:
  - [ ephemeral0, none, swap, sw, 0, 0 ]
 
bootcmd:
 - mkswap /dev/xvdb
 - swapon /dev/xvdb

since that seems a lot simpler.  However, add_swap will still happily create a swap volume on the root partition, even though there's already a partition present.  I don't really want to mess with the builders, so I think if possible I'll make a puppet patch to only install add_swap on builders (and to remove it under toplevel::server), and just rely on the approach above.  If it works.
Yup:

[root@buildbot-master999.bb.releng.use1.mozilla.com ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/xvdb                               partition       4188664 0       -1
/swap_file                              file            4194296 0       -2
Attached patch bug1130176.patchSplinter Review
Even easier than I thought.  We'll shortly be reimaging all AWS masters, so there's no need here to ensure absent.
Attachment #8561407 - Flags: review?(mgervasini)
Comment on attachment 8561407 [details] [diff] [review]
bug1130176.patch

looks good to me, thanks Dustin.
Attachment #8561407 - Flags: review?(mgervasini) → review+
This adds a new node definition, in the bb VLAN, for each AWS master in the srv VLAN (so, all but two).
Attachment #8561417 - Flags: review?(arich)
Attachment #8561417 - Flags: review?(arich) → review+
Depends on: 1132012
I figure I should get this committed, since it's in production.
Attachment #8562928 - Flags: review?(rail)
Attachment #8562928 - Flags: review?(rail) → review+
Done -- just cleaning up on the parent bug now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: