Closed
Bug 1210629
Opened 9 years ago
Closed 9 years ago
buildbot-master125 misbehaving
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: kmoir)
References
Details
Starts with
<nagios-releng> Thu 13:27:29 PDT [4650] buildbot-master125.bb.releng.usw2.mozilla.com (10.132.68.48) is DOWN :PING CRITICAL - Packet loss = 100%
It's kind of hard to try to figure out what happened here because ntp bounced the time around, but it looks like a spontaneous reboot.
It's disabled in slavealloc while we figure out what happened.
Reporter | ||
Comment 1•9 years ago
|
||
On the buildbot side we have these odd things:
# normal operation then start with a time jump and master start
2015-10-01 08:51:56-0700 [-] Claimed buildrequestids: [82728082L]
2015-10-01 082015-10-01 15:59:39-0700 [-] Log opened.
2015-10-01 15:59:39-0700 [-] twistd 10.1.0 (/builds/buildbot/tests1-linux64/bin/python 2.7.3) starting up.
2015-10-01 15:59:39-0700 [-] reactor class: twisted.internet.selectreactor.SelectReactor.
....
# another time discontinuity
2015-10-01 15:59:53-0700 [-] adding new builder Android 4.3 armv7 API 11+ ash opt test plain-reftest-6 for category ash
2015-10-01 15:59:53-0700 [-] trying to load status pickle from /builds/buildbot/tests1-linux64/master/ash_ubuntu64_vm_ar
2015-10-01 08:59:54-0700 [-] added builder Android 4.3 armv7 API 11+ ash opt test plain-reftest-6 in category ash
2015-10-01 08:59:54-0700 [-] adding new builder Ubuntu VM 12.04 x64 mozilla-release opt test jsreftest for category mozi
....
# startup completes
2015-10-01 09:00:18-0700 [-] BuildMaster listening on port tcp:9201
2015-10-01 09:00:18-0700 [-] configuration update started
2015-10-01 09:00:36-0700 [-] adding 1 new changesources, removing 0
2015-10-01 09:00:36-0700 [-] twisted.spread.pb.PBServerFactory starting on 9201
2015-10-01 09:00:36-0700 [-] Starting factory <twisted.spread.pb.PBServerFactory instance at 0x2ceae60>
2015-10-01 09:00:36-0700 [-] configuration update complete
2015-10-01 09:00:37-0700 [Broker,0,10.132.47.214] Got slaveinfo from 'tst-emulator64-spot-321'
2015-10-01 09:00:37-0700 [Broker,0,10.132.47.214] bot attached
but no jobs run, until the log ends at 2015-10-01 13:20:28-0700.
Reporter | ||
Comment 2•9 years ago
|
||
Over in syslog we have:
# looks like a reboot at 8:51, and coming up with the wrong time
Oct 1 08:46:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1158]: ethstat plugin: No stats available for eth0
Oct 1 08:51:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1158]: ethstat plugin: No stats available for eth0
Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com rsyslogd: [origin software="rsyslogd" swVersion="7.6.7" x-pid="837" x-info="http://www.rsyslog.com"] start
# later in the bootup we have possibly a missing disk check, unexpected sshd cert generation, and a -25199s time step
Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: Initialising Xen virtual ethernet driver.
Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT2-fs warning: mounting unchecked fs, running e2fsck is recommended
Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: Adding 31449596k swap on /dev/xvdb. Priority:-1 extents:1 across:31449596k
SS
Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: NET: Registered protocol family 10
Oct 1 15:59:35 buildbot-master125.bb.releng.usw2.mozilla.com kernel: lo: Disabled Privacy Extensions
Oct 1 15:59:51 buildbot-master125.bb.releng.usw2.mozilla.com [CLOUDINIT] util.py[WARNING]: Failed to run bootcmd module bootcmd
Oct 1 15:59:51 buildbot-master125.bb.releng.usw2.mozilla.com [CLOUDINIT] util.py[WARNING]: Running bootcmd (<module 'cloudinit.config.cc_bootcmd'
from '/usr/lib/python2.6/site-packages/cloudinit/config/cc_bootcmd.pyc'>) failed
Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Set /proc/self/oom_score_adj from 0 to -1000
Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Server listening on 0.0.0.0 port 22.
Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Server listening on :: port 22.
Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: Generating 1024 bit RSA key.
Oct 1 15:59:52 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1057]: RSA key generation complete.
Oct 1 08:59:54 buildbot-master125.bb.releng.usw2.mozilla.com ntpdate[1071]: step time server 10.26.75.40 offset -25199.406582 sec
Oct 1 08:59:54 buildbot-master125.bb.releng.usw2.mozilla.com ntpd[1078]: ntpd 4.2.6p5@1.2349-o Sat Nov 23 18:21:48 UTC 2013 (1)
# some yum changes, expected ?
Oct 1 11:25:25 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[packagesetup]/Packages::Setup/Packages::Yumrepo[auditd]/File[/etc/yum.repos.d/auditd.mirrors]/content) content changed '{md5}7553e54b269d41bd471e123960ba3552' to '{md5}868268ad807533b8fa04b6cadcba865f'
Oct 1 11:25:25 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[packagesetup]/Packages::Setup/Packages::Yumrepo[collectd]/File[/etc/yum.repos.d/collectd.mirrors]/content) content changed '{md5}1997e1fe0a4866179226df2df4267f3d' to '{md5}c3e10da0deb0f8aa08018a4b421b3eb0'
[and several more]
Oct 1 11:25:51 buildbot-master125.bb.releng.usw2.mozilla.com puppet-agent[7295]: (/Stage[main]/Main/Node[buildbot-master125.bb.releng.usw2.mozilla.com]/Buildmaster::Buildbot_master::Mozilla[bm125-tests1-linux64]/File[/builds/buildbot/tests1-linux64/master/BuildSlaves.py]/content) content changed '{md5}c0af2b3664af27cb7e5b0b6dc1a19281' to '{md5}853ba9241e9a35b6dd82e9340e825c0e'
# another spontaneous reboot ? NB time discontinuity
Oct 1 13:14:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1187]: ethstat plugin: No stats available for eth0
Oct 1 13:19:55 buildbot-master125.bb.releng.usw2.mozilla.com collectd[1187]: ethstat plugin: No stats available for eth0
Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com rsyslogd: [origin software="rsyslogd" swVersion="7.6.7" x-pid="839" x-info="http://www.rsyslog.com"] start
...
# some disk repairs
Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): INFO: recovery required on readonly filesystem
Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): write access will be enabled during recovery
Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): orphan cleanup on readonly fs
Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): 2 orphan inodes deleted
Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): recovery complete
Oct 1 20:28:38 buildbot-master125.bb.releng.usw2.mozilla.com kernel: EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: ...
# another time jump, are we cleaning up unused rsa host sig, so we regenerate every boot ?
Oct 1 20:28:54 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1047]: Generating 1024 bit RSA key.
Oct 1 20:28:54 buildbot-master125.bb.releng.usw2.mozilla.com sshd[1047]: RSA key generation complete.
Oct 1 13:28:56 buildbot-master125.bb.releng.usw2.mozilla.com ntpdate[1061]: step time server 10.26.75.40 offset -25199.270760 sec
Reporter | ||
Comment 3•9 years ago
|
||
So this machine looks pretty sick. It may just be the underlying AWS host (although nothing in EC2 console > Events), and we could shut it down and hope it starts up elsewhere. Given the potential for disk corruption maybe the best option is to terminate and recreate the instance.
Status: disabled in slavealloc, downtimed in nagios until mid-morning EDT, left turned on.
Blocks: 1205409
Assignee | ||
Comment 4•9 years ago
|
||
I'll terminate it and regenerate it tomorrow.
Assignee | ||
Comment 5•9 years ago
|
||
new master is up, renabled in slavealloc
Assignee | ||
Comment 6•9 years ago
|
||
I don't know why but is saying it needs a reboot after puppet again. (REBOOT REQUIRED: reboot_after_puppet) This messages was not there before. I've disabled in in slavealloc again. It didn't take any jobs today.
Assignee | ||
Comment 7•9 years ago
|
||
I ran puppet manually and rebooted it and now buildbot it up. The only thing puppet updated was the /etc/motd to say that it didn't need to be rebooted. Will keep an eye on it, doesn't show any jobs running yet but machines are attached to it.
Assignee | ||
Comment 8•9 years ago
|
||
looks like it is running jobs and happy
Assignee: nobody → kmoir
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•