Closed Bug 1342518 Opened 7 years ago Closed 5 years ago

Update releng amis to have larger /boot

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dhouse, Assigned: dhouse)

References

Details

Attachments

(1 file, 1 obsolete file)

60 bytes, text/x-github-pull-request
dividehex
: review+
rail
: review+
dividehex
: checked-in+
Details | Review
We cannot fit two of the recent kernels for CentOS 6.5 into the 60mb /boot on the current AMI's:

Quoting :arr on #specops re: bug 1330695 problems with kernel update failing because of not enough space:
@arr> ami-4dc07a26 and ami-58246f30 both have tiny /boot, too
looking over the amis:

us-east-1

AMI ID		ami-4dc07a26
snap-7e19a509	50GB	
instances:	pushapkworker-1		t2.micro
		signing-linux-1		t2.micro
		signing-linux-3		t2.micro
AMI Name		centos-65-x86_64-hvm-base-2015-08-28-15-51
314336048151/centos-65-x86_64-hvm-base-2015-08-28-15-51
Creation date	August 28, 2015 at 10:02:27 AM UTC-6
moz-created		1440795748
moz-instance-family	c3
moz-type			base
moz-virtualization-type	hvm

AMI ID		ami-58246f30
snap-1067f19f	50GB	
instances:	signingworker-3		t2.micro
		buildbot-master138	m3.large
		buildbot-master137	m3.large
		buildbot-master69	m3.large
		releng-puppet1		c3.xlarge
		dev-master2		m3.medium
		buildbot-master128	m3.large
		signingworker-1		t2.micro
		buildbot-master01	m3.large
AMI Name		centos-65-x86_64-hvm-base-2015-02-11-20-33
314336048151/centos-65-x86_64-hvm-base-2015-02-11-20-33
Creation date	February 11, 2015 at 1:40:59 PM UTC-7
moz-created		1423705259
moz-instance-family	c3
moz-type			base
moz-virtualization-type	hvm
Assignee: relops → dhouse
Steps for resize:
1. check/fix filesystem for / and /boot
2. Create mbr and /boot partition on new disk
2a. Partition /boot and set as boot
2b. Copy from old /boot to new /boot
2c. Copy mbr to new disk
3. Move / to new disk
3a.Create lvm partition on new disk
3b. Shrink current / filesystem to minimal used
3c. Add new lvm volume to group
3d. Copy lvm disk to new volume
3e. Remove old lvm volume
4. Snapshot new disk
5. Test new machine with new disk snapshot

As performed for the first ami, snap-7e19a509:
$ e2fsck -f /dev/cloud_root/lv_root 
e2fsck 1.42.12 (29-Aug-2014)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
root_dev: 31402/3276800 files (0.1% non-contiguous), 464763/13090816 blocks

$ dd if=/dev/sdb of=mbr bs=512 count=1
$ dd if=mbr of=/dev/sdc bs=446 count=1  # skip partition table, 446-512
$ dd if=/dev/sdb1 of=/dev/sdc1  # copy /boot

$ parted -s -a optimal /dev/sdc -- mkpart primary ext2 2048s 512MiB
$ parted /dev/sdc align-check optimal 1
$ parted -s -a optimal /dev/sdc -- mkpart primary ext2 512MiB -1s
$ parted /dev/sdc align-check optimal 2
$ parted -s /dev/sdc -- set 1 boot on
$ parted -s /dev/sdc -- set 2 lvm on
$ mkfs.ext2 /dev/sdc1
$ pvcreate /dev/sdc2
$ vgextend cloud_root /dev/sdc2

/dev/sdc1   *        2048     1048575      523264   83  Linux
/dev/sdc2         1048576   104857599    51904512   8e  Linux LVM

$ resize2fs /dev/cloud_root/lv_root -M
resize2fs 1.42.12 (29-Aug-2014)
Resizing the filesystem on /dev/cloud_root/lv_root to 388871 (4k) blocks.
The filesystem on /dev/cloud_root/lv_root is now 388871 (4k) blocks long.

$ pvmove /dev/sdb2
  Insufficient free space: 12784 extents needed, but only 12671 available
  Unable to allocate mirror extents for pvmove0.
  Failed to convert pvmove LV to mirrored
$ bc -l <<< '12671 * 8192'
103800832
$ lvreduce --size 103800832s /dev/cloud_root/lv_root 
  WARNING: Reducing active logical volume to 49.50 GiB
  THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce lv_root? [y/n]: y
  Size of logical volume cloud_root/lv_root changed from 49.50 GiB (12672 extents) to 49.50 GiB (12671 extents).
  Logical volume lv_root successfully resized
$ pvmove /dev/sdb2
  /dev/sdb2: Moved: 0.0%
  /dev/sdb2: Moved: 1.9%
[...]
$ vgreduce cloud_root /dev/sdb2
  Removed "/dev/sdb2" from volume group "cloud_root"
$ pvremove cloud_root /dev/sdb2

Requested snapshot of new disk:
snap-03289d578a0bef4ed
Created Image:
centos-65-x86_64-hvm-base-2017-03-17-16-00 (ami-2c7cd63a)
Steps for the second ami were slightly different (did not need to resize the logical volume to mirror it to the new disk),
e2fsck -f /dev/cloud_root/lv_root 
dd if=/dev/xvdb of=mbr bs=512 count=1
dd if=mbr of=/dev/xvdc bs=446 count=1  # skip partition table, 446-512
dd if=/dev/xvdb1 of=/dev/xvdc1  # copy /boot
parted -s /dev/xvdc -- mklabel msdos
parted -s -a optimal /dev/xvdc -- mkpart primary ext2 2048s 512MiB
parted /dev/xvdc align-check optimal 1
parted -s -a optimal /dev/xvdc -- mkpart primary ext2 512MiB -1s
parted /dev/xvdc align-check optimal 2
parted -s /dev/xvdc -- set 1 boot on
parted -s /dev/xvdc -- set 2 lvm on
mkfs.ext2 /dev/xvdc1
pvcreate /dev/xvdc2
vgextend cloud_root /dev/xvdc2
resize2fs /dev/cloud_root/lv_root -M
pvmove /dev/xvdb2
[...]
vgreduce cloud_root /dev/xvdb2
pvremove /dev/xvdb2
lvextend -l +100%FREE --resizefs /dev/cloud_root/lv_root

Requested snapshot of new disk:
snap-0b1eb8cb621f22757
Created Image:
centos-65-x86_64-hvm-base-2017-03-20-12-00 (ami-3833842e)
Hi Jake, could you tell me how to test/use the ami's and if I need to do the resize differently? For these two ami's, I expanded the /boot partition to 500mb and moved the / partition (reduced lvm group, partition and filesystem to 49.5gb from 49.94 to keep the image at 50gb).

I'm expecting that I need to change the ami specified in the build-cloud-tools config for each of the servers and then build new instances of each machine; I am not sure if that is a good plan or if there are files/config not in the scripts that I would need to manually copy and set up. I am wondering if it may be less risk (of missed config/files) and take less time to resize the disks on each machine manually instead of rebuilding them manually on the updated ami's.
Flags: needinfo?(jwatkins)
So I was under the impression that aws_create_ami.py script allowed you to create ami's from the ground up and that was where the 64mb /boot limit originated from.  If that is still the case, then we will need to modify the script [1][2] and rebuild the amis with that tool.  Once that is done, you can change the various ami_configs to the new ami ids.

At that point, we need to decide if we want to terminate and rebuild with aws_create_instance.py for each service or if we should login to each instance and manually manipulate the ebs stores volumes in place.  If an instance is 'instance based' (non-hvm), it may still need a terminate/rebuild anyway.

:rail might be a better resource for answering this question.  NI :rail for a sanity check here.

[1] https://github.com/mozilla-releng/build-cloud-tools/blob/0010b72a4690d370ecd4b8714af5f559e8849dbd/cloudtools/scripts/aws_create_ami.py#L37

[2] https://github.com/mozilla-releng/build-cloud-tools/blob/0010b72a4690d370ecd4b8714af5f559e8849dbd/cloudtools/scripts/aws_create_ami.py#L55
Flags: needinfo?(jwatkins) → needinfo?(rail)
I'd just bump https://github.com/mozilla-releng/build-cloud-tools/blob/0010b72a4690d370ecd4b8714af5f559e8849dbd/cloudtools/scripts/aws_create_ami.py#L56 to something like 128 or 256M. It won't help with the existing AMIs/instances though. We'll need to resize them manually or remove the existing kernel before we install the new one.
Flags: needinfo?(rail)
Attachment #8850650 - Flags: review?(rail)
Attachment #8850650 - Flags: review?(jwatkins)
Attachment #8850650 - Attachment description: github pr → Bug 1342518 - enlarge /boot to 256M
Attachment #8850650 - Flags: review?(rail) → review+
Comment on attachment 8850650 [details] [review]
Bug 1342518 - enlarge /boot to 256M

r+ and merged
Attachment #8850650 - Flags: review?(jwatkins)
Attachment #8850650 - Flags: review+
Attachment #8850650 - Flags: checked-in+
Thanks!
Attached file failed_first_create_ami.log (obsolete) —
Hi Rail,

I have some questions for my next step on this. Is it reasonable for me to create all new AMIs? Is there some one/group who usually does the base ami builds? Could you tell me a good set to start with, bld-linux64? (If we get a critical patch and we haven't rebuilt the services then I'll need to do as you said and "resize them manually or remove the existing kernel before we install the new one".)

It looks like I can do the following (I'm looking at the wiki here, https://wiki.mozilla.org/ReleaseEngineering:AWS#Create_AMI, and the mention here, https://wiki.mozilla.org/ReleaseEngineering/How_To/Work_with_Golden_AMIs#Base_AMI):
(aws_manager)[buildduty@aws-manager1.srv.releng.scl3.mozilla.com aws_manager]$ python /builds/aws_manager/cloud-tools/scripts/aws_create_ami.py --config bld-linux64 --region us-east-1 --secrets /builds/aws_manager/secrets/aws-secrets.json --key-name aws-releng --ssh-key /home/buildduty/.ssh/aws-ssh-key ref-centos-6-x86_64-hvm-base

However, that fails (I'm attaching the instance's system log). Do I need to use a different config to create the base, or do I need to run this from aws-manager2?
Flags: needinfo?(rail)
The procedure sounds correct to me. Can you paste some logs from aws-manager? There is usually $hostname.log file there.
Flags: needinfo?(rail)
The script doesn't save a log file, but here is the stdout/err for when I run it:

(aws_manager)[buildduty@aws-manager1.srv.releng.scl3.mozilla.com aws_manager]$ python /builds/aws_manager/cloud-tools/scripts/aws_create_ami.py --verbose --config bld-linux64 --region us-east-1 --secrets /builds/aws_manager/secrets/aws-secrets.json --key-name aws-releng --ssh-key /home/buildduty/.ssh/aws-ssh-key ref-centos-6-x86_64-hvm-base
INFO:cloudtools.aws.instance:instance Instance:i-0c0f790b65f4265b4 created, waiting to come up
DEBUG:cloudtools.aws:waiting for Instance:i-0c0f790b65f4265b4 availability
INFO:cloudtools.fabric:Using public DNS
[ec2-50-19-57-178.compute-1.amazonaws.com] run: date
DEBUG:cloudtools.aws.instance:hit error waiting for instance to come up
[ec2-50-19-57-178.compute-1.amazonaws.com] run: date
DEBUG:cloudtools.aws.instance:hit error waiting for instance to come up
[repeated until I interrupt it]
I was using the wrong config. I needed to use a config json from ami_configs:
(aws_manager)[buildduty@aws-manager1.srv.releng.scl3.mozilla.com build-cloud-tools-dhouse]$ python scripts/aws_create_ami.py --verbose --config centos-65-x86_64-hvm-base --region us-east-1 --secrets /builds/aws_manager/secrets/aws-secrets.json --key-name dhouse-test --ssh-key /home/buildduty/.ssh/aws-ssh-key ref-centos-6-x86_64-hvm-base
[...]
# also stuck in the "error waiting for instance to come up" loop, but packages were installed/etc
yeah, it creates an instance outside of the VPC, and aws-manager has issues accessing it. Can you try to repeat the same thing from "your laptop"? I vaguely remember running those outside of SCL3.
Attachment #8856968 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: