Closed Bug 1210629 Opened 9 years ago Closed 9 years ago

buildbot-master125 misbehaving

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: kmoir)

References

Details

Starts with
<nagios-releng> Thu 13:27:29 PDT [4650] buildbot-master125.bb.releng.usw2.mozilla.com (10.132.68.48) is DOWN :PING CRITICAL - Packet loss = 100%

It's kind of hard to try to figure out what happened here because ntp bounced the time around, but it looks like a spontaneous reboot.

It's disabled in slavealloc while we figure out what happened.
On the buildbot side we have these odd things:

# normal operation then start with a time jump and master start
2015-10-01 08:51:56-0700 [-] Claimed buildrequestids: [82728082L]
2015-10-01 082015-10-01 15:59:39-0700 [-] Log opened.
2015-10-01 15:59:39-0700 [-] twistd 10.1.0 (/builds/buildbot/tests1-linux64/bin/python 2.7.3) starting up.
2015-10-01 15:59:39-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
....
# another time discontinuity
2015-10-01 15:59:53-0700 [-] adding new builder Android 4.3 armv7 API 11+ ash opt test plain-reftest-6 for category ash
2015-10-01 15:59:53-0700 [-] trying to load status pickle from /builds/buildbot/tests1-linux64/master/ash_ubuntu64_vm_ar
2015-10-01 08:59:54-0700 [-] added builder Android 4.3 armv7 API 11+ ash opt test plain-reftest-6 in category ash
2015-10-01 08:59:54-0700 [-] adding new builder Ubuntu VM 12.04 x64 mozilla-release opt test jsreftest for category mozi
....
# startup completes
2015-10-01 09:00:18-0700 [-] BuildMaster listening on port tcp:9201
2015-10-01 09:00:18-0700 [-] configuration update started
2015-10-01 09:00:36-0700 [-] adding 1 new changesources, removing 0
2015-10-01 09:00:36-0700 [-] twisted.spread.pb.PBServerFactory starting on 9201
2015-10-01 09:00:36-0700 [-] Starting factory <twisted.spread.pb.PBServerFactory instance at 0x2ceae60>
2015-10-01 09:00:36-0700 [-] configuration update complete
2015-10-01 09:00:37-0700 [Broker,0,10.132.47.214] Got slaveinfo from 'tst-emulator64-spot-321'
2015-10-01 09:00:37-0700 [Broker,0,10.132.47.214] bot attached

but no jobs run, until the log ends at 2015-10-01 13:20:28-0700.
Over in syslog we have:

# looks like a reboot at 8:51, and coming up with the wrong time
Oct  1 08:46:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1158]: ethstat plugin: No stats available for eth0
Oct  1 08:51:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1158]: ethstat plugin: No stats available for eth0
Oct  1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com rsyslogd: [origin software="rsyslogd" swVersion="7.6.7" x-pid="837" x-info="http://www.rsyslog.com"] start

# later in the bootup we have possibly a missing disk check, unexpected sshd cert generation, and a -25199s time step
Oct  1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: Initialising Xen virtual ethernet driver.
Oct  1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended
Oct  1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: Adding 31449596k swap on /dev/xvdb.  Priority:-1 extents:1 across:31449596k
SS
Oct  1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: NET: Registered protocol family 10
Oct  1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: lo: Disabled Privacy Extensions
Oct  1 15:59:51 buildbot-master125.bb.releng.usw2.mozilla.com [CLOUDINIT] util.py[WARNING]: Failed to run bootcmd module bootcmd
Oct  1 15:59:51 buildbot-master125.bb.releng.usw2.mozilla.com [CLOUDINIT] util.py[WARNING]: Running bootcmd (<module 'cloudinit.config.cc_bootcmd'
 from '/usr/lib/python2.6/site-packages/cloudinit/config/cc_bootcmd.pyc'>) failed
Oct  1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Set /proc/self/oom_score_adj from 0 to -1000
Oct  1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Server listening on 0.0.0.0 port 22.
Oct  1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Server listening on :: port 22.
Oct  1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Generating 1024 bit RSA key.
Oct  1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: RSA key generation complete.
Oct  1 08:59:54 buildbot-master125.bb.releng.usw2.mozilla.com ntpdate[1071]: step time server 10.26.75.40 offset -25199.406582 sec
Oct  1 08:59:54 buildbot-master125.bb.releng.usw2.mozilla.com ntpd[1078]: ntpd 4.2.6p5@1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1)

# some yum changes, expected ?
Oct  1 11:25:25 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[packagesetup]/Packages::Setup/Packages::Yumrepo[auditd]/File[/etc/yum.repos.d/auditd.mirrors]/content) content changed '{md5}7553e54b269d41bd471e123960ba3552' to '{md5}868268ad807533b8fa04b6cadcba865f'
Oct  1 11:25:25 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[packagesetup]/Packages::Setup/Packages::Yumrepo[collectd]/File[/etc/yum.repos.d/collectd.mirrors]/content) content changed '{md5}1997e1fe0a4866179226df2df4267f3d' to '{md5}c3e10da0deb0f8aa08018a4b421b3eb0'
[and several more]
Oct  1 11:25:51 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[main]/Main/Node[buildbot-master125.bb.releng.usw2.mozilla.com]/Buildmaster::Buildbot_master::Mozilla[bm125-tests1-linux64]/File[/builds/buildbot/tests1-linux64/master/BuildSlaves.py]/content) content changed '{md5}c0af2b3664af27cb7e5b0b6dc1a19281' to '{md5}853ba9241e9a35b6dd82e9340e825c0e'

# another spontaneous reboot ? NB time discontinuity 
Oct  1 13:14:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1187]: ethstat plugin: No stats available for eth0
Oct  1 13:19:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1187]: ethstat plugin: No stats available for eth0
Oct  1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com rsyslogd: [origin software="rsyslogd" swVersion="7.6.7" x-pid="839" x-info="http://www.rsyslog.com"] start
...
# some disk repairs
Oct  1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): INFO: recovery required on readonly filesystem
Oct  1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): write access will be enabled during recovery
Oct  1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): orphan cleanup on readonly fs
Oct  1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): 2 orphan inodes deleted
Oct  1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): recovery complete
Oct  1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: ...

# another time jump, are we cleaning up unused rsa host sig, so we regenerate every boot ?
Oct  1 20:28:54 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1047]: Generating 1024 bit RSA key.
Oct  1 20:28:54 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1047]: RSA key generation complete.
Oct  1 13:28:56 buildbot-master125.bb.releng.usw2.mozilla.com ntpdate[1061]: step time server 10.26.75.40 offset -25199.270760 sec
So this machine looks pretty sick. It may just be the underlying AWS host (although nothing in EC2 console > Events), and we could shut it down and hope it starts up elsewhere. Given the potential for disk corruption maybe the best option is to terminate and recreate the instance.

Status: disabled in slavealloc, downtimed in nagios until mid-morning EDT, left turned on.
Blocks: 1205409
I'll terminate it and regenerate it tomorrow.
new master is up, renabled in slavealloc
I don't know why but is saying it needs a reboot after puppet again. (REBOOT REQUIRED: reboot_after_puppet) This messages was not there before.  I've disabled in in slavealloc again.  It didn't take any jobs today.
I ran puppet manually and rebooted it and now buildbot it up.  The only thing puppet updated was the /etc/motd to say that it didn't need to be rebooted.  Will keep an eye on it, doesn't show any jobs running yet but machines are attached to it.
looks like it is running jobs and happy
Assignee: nobody → kmoir
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.