Closed Bug 1210629 Opened 9 years ago Closed 9 years ago

buildbot-master125 misbehaving

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: kmoir)

References

Details

Starts with <nagios-releng> Thu 13:27:29 PDT [4650] buildbot-master125.bb.releng.usw2.mozilla.com (10.132.68.48) is DOWN :PING CRITICAL - Packet loss = 100% It's kind of hard to try to figure out what happened here because ntp bounced the time around, but it looks like a spontaneous reboot. It's disabled in slavealloc while we figure out what happened.
On the buildbot side we have these odd things: # normal operation then start with a time jump and master start 2015-10-01 08:51:56-0700 [-] Claimed buildrequestids: [82728082L] 2015-10-01 082015-10-01 15:59:39-0700 [-] Log opened. 2015-10-01 15:59:39-0700 [-] twistd 10.1.0 (/builds/buildbot/tests1-linux64/bin/python 2.7.3) starting up. 2015-10-01 15:59:39-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor. .... # another time discontinuity 2015-10-01 15:59:53-0700 [-] adding new builder Android 4.3 armv7 API 11+ ash opt test plain-reftest-6 for category ash 2015-10-01 15:59:53-0700 [-] trying to load status pickle from /builds/buildbot/tests1-linux64/master/ash_ubuntu64_vm_ar 2015-10-01 08:59:54-0700 [-] added builder Android 4.3 armv7 API 11+ ash opt test plain-reftest-6 in category ash 2015-10-01 08:59:54-0700 [-] adding new builder Ubuntu VM 12.04 x64 mozilla-release opt test jsreftest for category mozi .... # startup completes 2015-10-01 09:00:18-0700 [-] BuildMaster listening on port tcp:9201 2015-10-01 09:00:18-0700 [-] configuration update started 2015-10-01 09:00:36-0700 [-] adding 1 new changesources, removing 0 2015-10-01 09:00:36-0700 [-] twisted.spread.pb.PBServerFactory starting on 9201 2015-10-01 09:00:36-0700 [-] Starting factory <twisted.spread.pb.PBServerFactory instance at 0x2ceae60> 2015-10-01 09:00:36-0700 [-] configuration update complete 2015-10-01 09:00:37-0700 [Broker,0,10.132.47.214] Got slaveinfo from 'tst-emulator64-spot-321' 2015-10-01 09:00:37-0700 [Broker,0,10.132.47.214] bot attached but no jobs run, until the log ends at 2015-10-01 13:20:28-0700.
Over in syslog we have: # looks like a reboot at 8:51, and coming up with the wrong time Oct 1 08:46:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1158]: ethstat plugin: No stats available for eth0 Oct 1 08:51:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1158]: ethstat plugin: No stats available for eth0 Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com rsyslogd: [origin software="rsyslogd" swVersion="7.6.7" x-pid="837" x-info="http://www.rsyslog.com"] start # later in the bootup we have possibly a missing disk check, unexpected sshd cert generation, and a -25199s time step Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: Initialising Xen virtual ethernet driver. Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: Adding 31449596k swap on /dev/xvdb. Priority:-1 extents:1 across:31449596k SS Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: NET: Registered protocol family 10 Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: lo: Disabled Privacy Extensions Oct 1 15:59:51 buildbot-master125.bb.releng.usw2.mozilla.com [CLOUDINIT] util.py[WARNING]: Failed to run bootcmd module bootcmd Oct 1 15:59:51 buildbot-master125.bb.releng.usw2.mozilla.com [CLOUDINIT] util.py[WARNING]: Running bootcmd (<module 'cloudinit.config.cc_bootcmd' from '/usr/lib/python2.6/site-packages/cloudinit/config/cc_bootcmd.pyc'>) failed Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Set /proc/self/oom_score_adj from 0 to -1000 Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Server listening on 0.0.0.0 port 22. Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Server listening on :: port 22. Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Generating 1024 bit RSA key. Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: RSA key generation complete. Oct 1 08:59:54 buildbot-master125.bb.releng.usw2.mozilla.com ntpdate[1071]: step time server 10.26.75.40 offset -25199.406582 sec Oct 1 08:59:54 buildbot-master125.bb.releng.usw2.mozilla.com ntpd[1078]: ntpd 4.2.6p5@1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1) # some yum changes, expected ? Oct 1 11:25:25 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[packagesetup]/Packages::Setup/Packages::Yumrepo[auditd]/File[/etc/yum.repos.d/auditd.mirrors]/content) content changed '{md5}7553e54b269d41bd471e123960ba3552' to '{md5}868268ad807533b8fa04b6cadcba865f' Oct 1 11:25:25 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[packagesetup]/Packages::Setup/Packages::Yumrepo[collectd]/File[/etc/yum.repos.d/collectd.mirrors]/content) content changed '{md5}1997e1fe0a4866179226df2df4267f3d' to '{md5}c3e10da0deb0f8aa08018a4b421b3eb0' [and several more] Oct 1 11:25:51 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[main]/Main/Node[buildbot-master125.bb.releng.usw2.mozilla.com]/Buildmaster::Buildbot_master::Mozilla[bm125-tests1-linux64]/File[/builds/buildbot/tests1-linux64/master/BuildSlaves.py]/content) content changed '{md5}c0af2b3664af27cb7e5b0b6dc1a19281' to '{md5}853ba9241e9a35b6dd82e9340e825c0e' # another spontaneous reboot ? NB time discontinuity Oct 1 13:14:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1187]: ethstat plugin: No stats available for eth0 Oct 1 13:19:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1187]: ethstat plugin: No stats available for eth0 Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com rsyslogd: [origin software="rsyslogd" swVersion="7.6.7" x-pid="839" x-info="http://www.rsyslog.com"] start ... # some disk repairs Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): INFO: recovery required on readonly filesystem Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): write access will be enabled during recovery Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): orphan cleanup on readonly fs Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): 2 orphan inodes deleted Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): recovery complete Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: ... # another time jump, are we cleaning up unused rsa host sig, so we regenerate every boot ? Oct 1 20:28:54 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1047]: Generating 1024 bit RSA key. Oct 1 20:28:54 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1047]: RSA key generation complete. Oct 1 13:28:56 buildbot-master125.bb.releng.usw2.mozilla.com ntpdate[1061]: step time server 10.26.75.40 offset -25199.270760 sec
So this machine looks pretty sick. It may just be the underlying AWS host (although nothing in EC2 console > Events), and we could shut it down and hope it starts up elsewhere. Given the potential for disk corruption maybe the best option is to terminate and recreate the instance. Status: disabled in slavealloc, downtimed in nagios until mid-morning EDT, left turned on.
Blocks: 1205409
I'll terminate it and regenerate it tomorrow.
new master is up, renabled in slavealloc
I don't know why but is saying it needs a reboot after puppet again. (REBOOT REQUIRED: reboot_after_puppet) This messages was not there before. I've disabled in in slavealloc again. It didn't take any jobs today.
I ran puppet manually and rebooted it and now buildbot it up. The only thing puppet updated was the /etc/motd to say that it didn't need to be rebooted. Will keep an eye on it, doesn't show any jobs running yet but machines are attached to it.
looks like it is running jobs and happy
Assignee: nobody → kmoir
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.