Closed Bug 1135664 Opened 9 years ago Closed 9 years ago

Some masters don't have swap enabled

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: jlund)

References

Details

I suspect these fell between the cracks of the broken add_swap crontask and the cloud-init fix.  The fix should be as simple as (a) fixing their cloud-init script and (b) setting up swap by hand (to avoid a reboot).
Odd --

[root@buildbot-master115.bb.releng.usw2.mozilla.com ~]# umount /mnt
[root@buildbot-master115.bb.releng.usw2.mozilla.com ~]# mkswap /dev/xvdb
mkswap: /dev/xvdb: warning: don't erase bootbits sectors
        on whole disk. Use -f to force.
Setting up swapspace version 1, size = 4188668 KiB
no label, UUID=a69e1183-9ba4-486c-8348-c05a5387f6bc
[root@buildbot-master115.bb.releng.usw2.mozilla.com ~]# swapon /dev/xvdb
[root@buildbot-master115.bb.releng.usw2.mozilla.com ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/xvdb                               partition       4188668 0       -1
/swap_file                              file            4194300 0       -2

so that already had almost 4G of swap on /swap_file (not ideal, as it's an EBS volume), yet top was saying

Mem:   3729144k total,   191100k used,  3538044k free,    12912k buffers
Swap:        0k total,        0k used,        0k free,    59968k cached
Assignee: relops → dustin
Maybe that was a fluke:

[root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# swapon -s
Filename                                Type            Size    Used    Priority
[root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# umount /mnt
[root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# mkswap /dev/xvdb
mkswap: /dev/xvdb: warning: don't erase bootbits sectors
        on whole disk. Use -f to force.
Setting up swapspace version 1, size = 4188668 KiB
no label, UUID=7f6d4c21-bf5a-482b-b7df-095417773be5
[root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# swapon /dev/xvdb
[root@buildbot-master53.bb.releng.usw2.mozilla.com ~]# swapon -s
Filename                                Type            Size    Used    Priority
/dev/xvdb                               partition       4188668 0       -1
Masters to fix:

buildbot-master04.bb.releng.usw2.mozilla.com
buildbot-master05.bb.releng.usw2.mozilla.com
buildbot-master06.bb.releng.usw2.mozilla.com
buildbot-master53.bb.releng.usw2.mozilla.com (above)
buildbot-master54.bb.releng.usw2.mozilla.com
buildbot-master66.bb.releng.usw2.mozilla.com
buildbot-master68.bb.releng.usw2.mozilla.com
buildbot-master72.bb.releng.usw2.mozilla.com
buildbot-master73.bb.releng.usw2.mozilla.com
buildbot-master74.bb.releng.usw2.mozilla.com
buildbot-master78.bb.releng.usw2.mozilla.com
buildbot-master79.bb.releng.usw2.mozilla.com
buildbot-master91.bb.releng.usw2.mozilla.com
buildbot-master115.bb.releng.usw2.mozilla.com (above)
buildbot-master116.bb.releng.usw2.mozilla.com 
buildbot-master117.bb.releng.use1.mozilla.com 
buildbot-master118.bb.releng.usw2.mozilla.com 
buildbot-master120.bb.releng.use1.mozilla.com 
buildbot-master121.bb.releng.use1.mozilla.com 
buildbot-master122.bb.releng.usw2.mozilla.com 
buildbot-master123.bb.releng.usw2.mozilla.com
All now have swap enabled manually, and I verified that buildbot is running on all of them, too.

Now to fix the cloud-init stuff in their instance configuration.
I want the cloud-init snippet from bug 1130176 comment 10.  However, the 'mounts' but only runs on first instance startup, and edits fstab, so I'll need to do that by hand.  The bootcmd will safely run on every boot, though.
..but, of course, you can't change userData while an instance is running.  I'm *not* going to try to graceful all the masters again, because that takes days.  Here's the list of masters that need their user-data updated (those tagged "BAD"):

=== buildbot-master51 OK
=== buildbot-master121 BAD
=== buildbot-master52 OK
=== buildbot-master03 OK
=== buildbot-master70 OK
=== buildbot-master120 BAD
=== buildbot-master94 OK
=== buildbot-master113 OK
=== buildbot-master76 OK
=== buildbot-master67 OK
=== buildbot-master01 OK
=== buildbot-master117 BAD
=== buildbot-master02 OK
=== buildbot-master69 OK
=== buildbot-master77 OK
=== buildbot-master71 OK
=== buildbot-master114 OK
=== buildbot-master75 OK
=== buildbot-master118 BAD
=== buildbot-master78 BAD
=== buildbot-master68 BAD
=== buildbot-master115 BAD
=== buildbot-master79 BAD
=== buildbot-master54 BAD
=== buildbot-master123 BAD
=== buildbot-master74 BAD
=== buildbot-master66 BAD
=== buildbot-master53 BAD
=== buildbot-master116 BAD
=== buildbot-master05 BAD
=== buildbot-master06 BAD
=== buildbot-master91 BAD
=== buildbot-master04 BAD
=== buildbot-master73 BAD
=== buildbot-master122 BAD
=== buildbot-master72 BAD

I think the best we can do is to try to keep this in the back of our minds for the next time there's an all-out tree closure or a TCW, and just stop all of the bad masters uncleanly, update their user data, and start them back up again.  Here's a cheap-o script to do that:

import boto.ec2
import base64
import gzip

for rgn in 'us-east-1', 'us-west-2':
    conn = boto.ec2.connect_to_region(rgn)
    for res in conn.get_all_instances(filters={'tag:Name': 'buildbot-master*'}):
        name = res.instances[0].tags['Name']
        iid = res.instances[0].id
        print "===", name,
        userData = base64.b64decode(str(conn.get_instance_attribute(iid, 'userData')['userData']))
        if 'swapon /dev/xvdb' in userData:
            print "OK"
            continue
        newData = userData.replace('mounts:\n - [ ephemeral0, /mnt, auto, "defaults,noexec" ]\n', """\
mounts:
 - [ ephemeral0, none, swap, sw, 0, 0 ]
bootcmd:
 - mkswap /dev/xvdb
 - swapon /dev/xvdb
""")
        encoded = base64.b64encode(newData)
        conn.modify_instance_attribute(iid, 'userData', encoded)
        print "Fixed"
Assignee: dustin → relops
Hal, 

See c#6 for work relops or releng should perform during next TCW.
Flags: cab-review?
dev-master2, too.
We should do bug 1136527 at the same time.
Approved for March 15th TCW - reviewed by CAB 2/25
Flags: cab-review? → cab-review+
Assignee: relops → jlund
I fixed the ones marked as BAD that coincided with the masters I upgraded in https://bugzilla.mozilla.org/show_bug.cgi?id=1136527#c6

so,

done:
bm53 (complete)
buildbot-master54 (complete)
buildbot-master68 (complete)
buildbot-master115 (complete)

still todo:
buildbot-master117 BAD
buildbot-master120 BAD
buildbot-master121 BAD
buildbot-master116 BAD
buildbot-master118 BAD
buildbot-master122 BAD
buildbot-master123 BAD
buildbot-master04 BAD
buildbot-master05 BAD
buildbot-master06 BAD
buildbot-master66 BAD
buildbot-master72 BAD
buildbot-master73 BAD
buildbot-master74 BAD
buildbot-master78 BAD
buildbot-master79 BAD
buildbot-master91 BAD
fixed today:
buildbot-master117
buildbot-master120
buildbot-master121
buildbot-master116
buildbot-master118
buildbot-master122
buildbot-master123

still todo:
buildbot-master04 BAD
buildbot-master05 BAD
buildbot-master06 BAD
buildbot-master66 BAD
buildbot-master72 BAD
buildbot-master73 BAD
buildbot-master74 BAD
buildbot-master78 BAD
buildbot-master79 BAD
buildbot-master91 BAD

I will likely be leaving the remaining for the tree closure as I am done with Bug 1136527

> buildbot-master04 BAD
> buildbot-master05 BAD
> buildbot-master06 BAD
> buildbot-master66 BAD
> buildbot-master72 BAD
> buildbot-master73 BAD
> buildbot-master74 BAD
> buildbot-master78 BAD
> buildbot-master79 BAD
> buildbot-master91 BAD

the remaining masters have been completed and brought back online. master66 is also back and running bumper. closing for now.

ni: myself to check most recent jobs on masters in a bit
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(jlund)
Resolution: --- → FIXED
73, 78, and 79 haven't taken a job recently but with pending 0 for their master type, I can't see anything currently wrong.
Flags: needinfo?(jlund)
Change Request: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.