Closed
Bug 1130176
Opened 9 years ago
Closed 9 years ago
Re-create AWS buildbot masters with CentOS 6.5
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
References
Details
Attachments
(3 files)
990 bytes,
patch
|
massimo
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
19.18 KB,
patch
|
arich
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
52 bytes,
text/x-github-pull-request
|
rail
:
review+
|
Details | Review |
First, I'll update the instance definition to use the new CentOS 6.5 hvm-base AMI. Then, I'll start creating new instances in the BB VLAN, complete with puppetization under the new hostname. Once the old host finishes its graceful stop, I'll start up the new master. This will require some inventory changes, updates to production-masters.json, and updates to puppet. It'd be a little easier to reimage in place, but then we'll still be in the srv VLAN.
Assignee | ||
Comment 1•9 years ago
|
||
I built a temporary buildbot-master999 to test this out.
Assignee | ||
Comment 2•9 years ago
|
||
My test master was built with an updated 'buildbot-master' profile, differing from server-linux64 only in subnet/domain, instance_type, moz-type, and instance_profile_name. The result looks like: [root@buildbot-master999.bb.releng.use1.mozilla.com ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/cloud_root-lv_root 35G 5.6G 28G 18% / none 1.9G 0 1.9G 0% /dev/shm /dev/xvda1 60M 44M 13M 79% /boot while a "real" and reasonably recent master looks like [root@buildbot-master76.srv.releng.use1.mozilla.com ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/xvde1 15G 7.0G 7.1G 50% / none 1.9G 236K 1.9G 1% /dev/shm /dev/mapper/instances_storage-builds 4.0G 4.0G 0 100% /mnt/instance_storage That last bit is, swap on ephemeral storage, via LVM for reasons I can't fathom: [root@buildbot-master76.srv.releng.use1.mozilla.com ~]# cat /proc/swaps Filename Type Size Used Priority /mnt/instance_storage/swap_file file 4046652 13404 -1 [root@buildbot-master76.srv.releng.use1.mozilla.com ~]# pvs PV VG Fmt Attr PSize PFree /dev/xvdf instances_storage lvm2 a-- 3.99g 0 From what I understand, the extra 20G on the root volume won't hurt anything (and to change it would require baking a new base image). I need to figure out how the swap is set up and replicate it here, but otherwise this looks ready to roll. In other news, per email we're good to go on the move to the BB VLAN. It will involve some additional steps, but I'll work those out today.
Assignee | ||
Comment 3•9 years ago
|
||
The add_swap init script looks for the first mounted device under /mnt or /builds, "Known SSD mount points". So the question is, why is the isntance storage not mounted there? Something cloud-init related?
Assignee | ||
Comment 4•9 years ago
|
||
Rail and I discussed options here. There are a few moving parts: * aws_create_instance creates a BlockDeviceMapping based on `device_map` in `config/buildbot-master`; for EBS-backed isntances (which these are), this is required to get access to ephemeral volumes. * on startup, cloud-init reads instance metadata which is generated from `config/buildbot-master.cloud-init`. cloud-init can mount things. * there's puppet code to add /etc/init.d/add_swap which looks for a volume mounted at /mnt or /boot and makes a swap file on it. It looks like the existing AWS masters had some of this set up by hand -- we're not sure how. So the question is, how best to re-implement it. The option we seem to have landed on is to configure only one ephemeral drive (these are m3.medium's, so they only have one 4GB SSD anyway) in `device_map`, then convince cloud-init to mount that at /mnt. At that point, add_swap will find and use the swap. This didn't work: mounts: - [ ephemeral0, none, swap ] I don't see anything about mounting in the console logs, either.
Assignee | ||
Comment 5•9 years ago
|
||
The device_mapping bit is working fine -- /dev/xvdb shows up in lsblk, but it's unused. Omitting 'mounts' entirely (which in theory defaults to mounting ephemeral storage at /mnt) didn't work either. Trying (straight from the examples): mounts: - [ ephemeral0, /mnt, auto, "defaults,noexec" ] didn't work, either. But that data may be invalid - it seems that cloud-init only runs the 'mounts' module once per instance, so I've really only tested [ ephemeral0, none, swap ].
Assignee | ||
Comment 6•9 years ago
|
||
Re-creating with 'mounts' entirely omitted didn't work either. Running `/etc/init.d/cloud-init start` said Starting cloud-init: /usr/lib/python2.6/site-packages/cloudinit/url_helper.py:40: UserWarning: Module backports was already imported from /usr/lib64/python2.6/site-packages/backports/__init__.pyc, but /usr/lib/python2.6/site-packages is being added to sys.path import pkg_resources Cloud-init v. 0.7.4 running 'init' at Fri, 06 Feb 2015 23:57:59 +0000. Up 284.46 seconds. ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++ ci-info: +--------+------+--------------+-----------------+-------------------+ ci-info: | Device | Up | Address | Mask | Hw-Address | ci-info: +--------+------+--------------+-----------------+-------------------+ ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . | ci-info: | eth0 | True | 10.134.68.16 | 255.255.255.192 | 0e:39:52:1d:2f:eb | ci-info: +--------+------+--------------+-----------------+-------------------+ ci-info: ++++++++++++++++++++++++++++++++Route info+++++++++++++++++++++++++++++++++ ci-info: +-------+-------------+-------------+-----------------+-----------+-------+ ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | ci-info: +-------+-------------+-------------+-----------------+-----------+-------+ ci-info: | 0 | 10.134.68.0 | 0.0.0.0 | 255.255.255.192 | eth0 | U | ci-info: | 1 | 169.254.0.0 | 0.0.0.0 | 255.255.0.0 | eth0 | U | ci-info: | 2 | 0.0.0.0 | 10.134.68.1 | 0.0.0.0 | eth0 | UG | ci-info: +-------+-------------+-------------+-----------------+-----------+-------+ 2015-02-06 18:58:01,045 - util.py[WARNING]: Failed to disable password for user cloud-user 2015-02-06 18:58:01,046 - util.py[WARNING]: Running users-groups (<module 'cloudinit.config.cc_users_groups' from '/usr/lib/python2.6/site-packages/cloudinit/config/cc_users_groups.pyc'>) failed cloud-init goes out of its way to make sure its logging is black-holed, unfortunately -- it closes stdout and frequently resets Python's logging library so to silence output.
Assignee | ||
Comment 7•9 years ago
|
||
This seems to work, on some manual futzing: mounts: - [ ephemeral0, /mnt, auto, "defaults,noexec" ] But it turns out that the 'mounts' module runs in the 'config' mode, while /etc/init.d/cloud-init start only runs the 'init' mode. I'm going to try again with a fresh AMI creation.
Assignee | ||
Comment 8•9 years ago
|
||
Well, that sorta worked, but add_swap still created the swap volume on the root disk: [root@buildbot-master999.bb.releng.use1.mozilla.com ~]# cat /etc/fstab LABEL=root_dev / ext4 defaults,noatime,nodiratime,commit=60 1 1 /dev/xvda1 /boot ext2 rw 0 0 none /proc proc defaults 0 0 none /sys sysfs defaults 0 0 none /dev/pts devpts gid=5,mode=620 0 0 none /dev/shm tmpfs defaults 0 0 /dev/xvdb /mnt auto defaults,noexec,comment=cloudconfig 0 2 [root@buildbot-master999.bb.releng.use1.mozilla.com ~]# swapon -s Filename Type Size Used Priority /swap_file file 4194296 0 -1
Assignee | ||
Comment 9•9 years ago
|
||
So part of that is caused by add_swap searching for ext4, when this filesystem is ext3. Another part of it is that the filesystem has only 3.7 GB available, while add_swap tries to allocate a 4GB swap. [root@buildbot-master999.bb.releng.use1.mozilla.com ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 35G 0 disk ├─xvda1 202:1 0 61M 0 part /boot └─xvda2 202:2 0 35G 0 part └─cloud_root-lv_root (dm-0) 253:0 0 35G 0 lvm / xvdb 202:16 0 4G 0 disk /mnt [root@buildbot-master999.bb.releng.use1.mozilla.com ~]# df -h /mnt Filesystem Size Used Avail Use% Mounted on /dev/xvdb 4.0G 73M 3.7G 2% /mnt whereas [root@buildbot-master71.srv.releng.use1.mozilla.com ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvde1 202:65 0 15G 0 disk / xvdf 202:80 0 4G 0 disk └─instances_storage-builds (dm-0) 253:0 0 4G 0 lvm /mnt/instance_storage [root@buildbot-master71.srv.releng.use1.mozilla.com ~]# pvs PV VG Fmt Attr PSize PFree /dev/xvdf instances_storage lvm2 a-- 3.99g 0 [root@buildbot-master71.srv.releng.use1.mozilla.com ~]# df -h /mnt/instance_storage/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/instances_storage-builds 4.0G 4.0G 0 100% /mnt/instance_storage I don't know how it allocates a full 4GB file on a 3.99g disk, with room for filesystem overhead. Well, I do really, it cheats: [root@buildbot-master71.srv.releng.use1.mozilla.com ~]# ls -al /mnt/instance_storage/swap_file -rw------- 1 root root 4143779840 Sep 30 00:26 /mnt/instance_storage/swap_file ^^^^^^^^^^ not quite 4G! So I think I'll try to figure out how much free space is on the disk and fallocate that.
Assignee | ||
Comment 10•9 years ago
|
||
I switched things around to follow http://blog.brianbeach.com/2014/12/configuring-linux-swap-device-with.html: mounts: - [ ephemeral0, none, swap, sw, 0, 0 ] bootcmd: - mkswap /dev/xvdb - swapon /dev/xvdb since that seems a lot simpler. However, add_swap will still happily create a swap volume on the root partition, even though there's already a partition present. I don't really want to mess with the builders, so I think if possible I'll make a puppet patch to only install add_swap on builders (and to remove it under toplevel::server), and just rely on the approach above. If it works.
Assignee | ||
Comment 11•9 years ago
|
||
Yup: [root@buildbot-master999.bb.releng.use1.mozilla.com ~]# swapon -s Filename Type Size Used Priority /dev/xvdb partition 4188664 0 -1 /swap_file file 4194296 0 -2
Assignee | ||
Comment 12•9 years ago
|
||
Even easier than I thought. We'll shortly be reimaging all AWS masters, so there's no need here to ensure absent.
Attachment #8561407 -
Flags: review?(mgervasini)
Comment 13•9 years ago
|
||
Comment on attachment 8561407 [details] [diff] [review] bug1130176.patch looks good to me, thanks Dustin.
Attachment #8561407 -
Flags: review?(mgervasini) → review+
Assignee | ||
Comment 14•9 years ago
|
||
This adds a new node definition, in the bb VLAN, for each AWS master in the srv VLAN (so, all but two).
Attachment #8561417 -
Flags: review?(arich)
Assignee | ||
Comment 15•9 years ago
|
||
Comment on attachment 8561407 [details] [diff] [review] bug1130176.patch remote: https://hg.mozilla.org/build/puppet/rev/8dd49c5a3a46 remote: https://hg.mozilla.org/build/puppet/rev/2ddf2055730b
Attachment #8561407 -
Flags: checked-in+
Updated•9 years ago
|
Attachment #8561417 -
Flags: review?(arich) → review+
Assignee | ||
Comment 16•9 years ago
|
||
Comment on attachment 8561417 [details] [diff] [review] bug1130176-new-masters.patch remote: https://hg.mozilla.org/build/puppet/rev/6b347b11741f remote: https://hg.mozilla.org/build/puppet/rev/0d591a07e9d0
Attachment #8561417 -
Flags: checked-in+
Assignee | ||
Comment 17•9 years ago
|
||
I figure I should get this committed, since it's in production.
Attachment #8562928 -
Flags: review?(rail)
Updated•9 years ago
|
Attachment #8562928 -
Flags: review?(rail) → review+
Assignee | ||
Comment 18•9 years ago
|
||
Done -- just cleaning up on the parent bug now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•