Closed
Bug 1135664
Opened 9 years ago
Closed 9 years ago
Some masters don't have swap enabled
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: jlund)
References
Details
I suspect these fell between the cracks of the broken add_swap crontask and the cloud-init fix. The fix should be as simple as (a) fixing their cloud-init script and (b) setting up swap by hand (to avoid a reboot).
Reporter | ||
Comment 1•9 years ago
|
||
Odd -- [root@buildbot-master115.bb.releng.usw2.mozilla.com ~]# umount /mnt [root@buildbot-master115.bb.releng.usw2.mozilla.com ~]# mkswap /dev/xvdb mkswap: /dev/xvdb: warning: don't erase bootbits sectors on whole disk. Use -f to force. Setting up swapspace version 1, size = 4188668 KiB no label, UUID=a69e1183-9ba4-486c-8348-c05a5387f6bc [root@buildbot-master115.bb.releng.usw2.mozilla.com ~]# swapon /dev/xvdb [root@buildbot-master115.bb.releng.usw2.mozilla.com ~]# swapon -s Filename Type Size Used Priority /dev/xvdb partition 4188668 0 -1 /swap_file file 4194300 0 -2 so that already had almost 4G of swap on /swap_file (not ideal, as it's an EBS volume), yet top was saying Mem: 3729144k total, 191100k used, 3538044k free, 12912k buffers Swap: 0k total, 0k used, 0k free, 59968k cached
Reporter | ||
Updated•9 years ago
|
Assignee: relops → dustin
Reporter | ||
Comment 2•9 years ago
|
||
Maybe that was a fluke: [root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# swapon -s Filename Type Size Used Priority [root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# umount /mnt [root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# mkswap /dev/xvdb mkswap: /dev/xvdb: warning: don't erase bootbits sectors on whole disk. Use -f to force. Setting up swapspace version 1, size = 4188668 KiB no label, UUID=7f6d4c21-bf5a-482b-b7df-095417773be5 [root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# swapon /dev/xvdb [root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# swapon -s Filename Type Size Used Priority /dev/xvdb partition 4188668 0 -1
Reporter | ||
Comment 3•9 years ago
|
||
Masters to fix: buildbot-master04.bb.releng.usw2.mozilla.com buildbot-master05.bb.releng.usw2.mozilla.com buildbot-master06.bb.releng.usw2.mozilla.com buildbot-master53.bb.releng.usw2.mozilla.com (above) buildbot-master54.bb.releng.usw2.mozilla.com buildbot-master66.bb.releng.usw2.mozilla.com buildbot-master68.bb.releng.usw2.mozilla.com buildbot-master72.bb.releng.usw2.mozilla.com buildbot-master73.bb.releng.usw2.mozilla.com buildbot-master74.bb.releng.usw2.mozilla.com buildbot-master78.bb.releng.usw2.mozilla.com buildbot-master79.bb.releng.usw2.mozilla.com buildbot-master91.bb.releng.usw2.mozilla.com buildbot-master115.bb.releng.usw2.mozilla.com (above) buildbot-master116.bb.releng.usw2.mozilla.com buildbot-master117.bb.releng.use1.mozilla.com buildbot-master118.bb.releng.usw2.mozilla.com buildbot-master120.bb.releng.use1.mozilla.com buildbot-master121.bb.releng.use1.mozilla.com buildbot-master122.bb.releng.usw2.mozilla.com buildbot-master123.bb.releng.usw2.mozilla.com
Reporter | ||
Comment 4•9 years ago
|
||
All now have swap enabled manually, and I verified that buildbot is running on all of them, too. Now to fix the cloud-init stuff in their instance configuration.
Reporter | ||
Comment 5•9 years ago
|
||
I want the cloud-init snippet from bug 1130176 comment 10. However, the 'mounts' but only runs on first instance startup, and edits fstab, so I'll need to do that by hand. The bootcmd will safely run on every boot, though.
Reporter | ||
Comment 6•9 years ago
|
||
..but, of course, you can't change userData while an instance is running. I'm *not* going to try to graceful all the masters again, because that takes days. Here's the list of masters that need their user-data updated (those tagged "BAD"): === buildbot-master51 OK === buildbot-master121 BAD === buildbot-master52 OK === buildbot-master03 OK === buildbot-master70 OK === buildbot-master120 BAD === buildbot-master94 OK === buildbot-master113 OK === buildbot-master76 OK === buildbot-master67 OK === buildbot-master01 OK === buildbot-master117 BAD === buildbot-master02 OK === buildbot-master69 OK === buildbot-master77 OK === buildbot-master71 OK === buildbot-master114 OK === buildbot-master75 OK === buildbot-master118 BAD === buildbot-master78 BAD === buildbot-master68 BAD === buildbot-master115 BAD === buildbot-master79 BAD === buildbot-master54 BAD === buildbot-master123 BAD === buildbot-master74 BAD === buildbot-master66 BAD === buildbot-master53 BAD === buildbot-master116 BAD === buildbot-master05 BAD === buildbot-master06 BAD === buildbot-master91 BAD === buildbot-master04 BAD === buildbot-master73 BAD === buildbot-master122 BAD === buildbot-master72 BAD I think the best we can do is to try to keep this in the back of our minds for the next time there's an all-out tree closure or a TCW, and just stop all of the bad masters uncleanly, update their user data, and start them back up again. Here's a cheap-o script to do that: import boto.ec2 import base64 import gzip for rgn in 'us-east-1', 'us-west-2': conn = boto.ec2.connect_to_region(rgn) for res in conn.get_all_instances(filters={'tag:Name': 'buildbot-master*'}): name = res.instances[0].tags['Name'] iid = res.instances[0].id print "===", name, userData = base64.b64decode(str(conn.get_instance_attribute(iid, 'userData')['userData'])) if 'swapon /dev/xvdb' in userData: print "OK" continue newData = userData.replace('mounts:\n - [ ephemeral0, /mnt, auto, "defaults,noexec" ]\n', """\ mounts: - [ ephemeral0, none, swap, sw, 0, 0 ] bootcmd: - mkswap /dev/xvdb - swapon /dev/xvdb """) encoded = base64.b64encode(newData) conn.modify_instance_attribute(iid, 'userData', encoded) print "Fixed"
Reporter | ||
Updated•9 years ago
|
Assignee: dustin → relops
Comment 7•9 years ago
|
||
Hal, See c#6 for work relops or releng should perform during next TCW.
Flags: cab-review?
Reporter | ||
Comment 8•9 years ago
|
||
dev-master2, too.
Comment 9•9 years ago
|
||
We should do bug 1136527 at the same time.
Comment 10•9 years ago
|
||
Approved for March 15th TCW - reviewed by CAB 2/25
Flags: cab-review? → cab-review+
Updated•9 years ago
|
Assignee: relops → jlund
Assignee | ||
Comment 11•9 years ago
|
||
I fixed the ones marked as BAD that coincided with the masters I upgraded in https://bugzilla.mozilla.org/show_bug.cgi?id=1136527#c6 so, done: bm53 (complete) buildbot-master54 (complete) buildbot-master68 (complete) buildbot-master115 (complete) still todo: buildbot-master117 BAD buildbot-master120 BAD buildbot-master121 BAD buildbot-master116 BAD buildbot-master118 BAD buildbot-master122 BAD buildbot-master123 BAD buildbot-master04 BAD buildbot-master05 BAD buildbot-master06 BAD buildbot-master66 BAD buildbot-master72 BAD buildbot-master73 BAD buildbot-master74 BAD buildbot-master78 BAD buildbot-master79 BAD buildbot-master91 BAD
Assignee | ||
Comment 12•9 years ago
|
||
fixed today: buildbot-master117 buildbot-master120 buildbot-master121 buildbot-master116 buildbot-master118 buildbot-master122 buildbot-master123 still todo: buildbot-master04 BAD buildbot-master05 BAD buildbot-master06 BAD buildbot-master66 BAD buildbot-master72 BAD buildbot-master73 BAD buildbot-master74 BAD buildbot-master78 BAD buildbot-master79 BAD buildbot-master91 BAD I will likely be leaving the remaining for the tree closure as I am done with Bug 1136527
Assignee | ||
Comment 13•9 years ago
|
||
> buildbot-master04 BAD
> buildbot-master05 BAD
> buildbot-master06 BAD
> buildbot-master66 BAD
> buildbot-master72 BAD
> buildbot-master73 BAD
> buildbot-master74 BAD
> buildbot-master78 BAD
> buildbot-master79 BAD
> buildbot-master91 BAD
the remaining masters have been completed and brought back online. master66 is also back and running bumper. closing for now.
ni: myself to check most recent jobs on masters in a bit
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(jlund)
Resolution: --- → FIXED
Assignee | ||
Comment 14•9 years ago
|
||
73, 78, and 79 haven't taken a job recently but with pending 0 for their master type, I can't see anything currently wrong.
Flags: needinfo?(jlund)
Updated•9 years ago
|
Change Request: --- → approved
Flags: cab-review+
You need to log in
before you can comment on or make changes to this bug.
Description
•