Closed
Bug 896948
(buildbot-master79)
Opened 11 years ago
Closed 11 years ago
buildbot-master79 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P2)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: rail)
Details
(Whiteboard: [buildduty][buildslaves][capacity])
First nagios alert: Tue 02:20:58 PDT [472] buildbot-master79.srv.releng.usw2.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100% Disabled in slavealloc to get slaves on the other five mac tests masters.
Reporter | ||
Comment 1•11 years ago
|
||
* confirmed unreachable via ssh * No events found in console for usw2 * AWS instance monitoring detected failure of reachability at 2013-07-23 02:15 GMT-0700 * console says this under the Status Tab We are actively trying to resolve this issue. You can try one of the following options: Stop and start the instance (if EBS-backed AMI). Terminate the instance and launch a replacement (if instance-store backed AMI). Wait for us to resolve the issue.
Reporter | ||
Comment 2•11 years ago
|
||
Attempted to Stop the instance using the console, went into stopping state but no further progress after a few minutes, so Force Stop by using Stop again. Still in a 'stopping' stage. Over to rail as the resident expert on these VMs. The obvious courses at this point are opening a support ticket to try to recover the existing VM, trying the API to see if it's more responsive or detailed than the console, or terminating the instance and recreating.
Assignee: nobody → rail
Reporter | ||
Comment 3•11 years ago
|
||
The list of 29 builds that buildapi said were running on this master have already gone away, presumably to be rerun on other masters.
Assignee | ||
Comment 4•11 years ago
|
||
It's back now. I'm going to investigate a bit this issue before enabling it in slavealloc. For build/test machines we check for so called "impaired" status (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html) and it's quite common (5-10 times a day). Probably for other types of instances we should report impaired status events. FTR, it would be also useful to get console output (Get System Log) before you stop.
Assignee | ||
Comment 5•11 years ago
|
||
Nothing interesting in the logs: Jul 23 01:48:25 buildbot-master79 puppet-agent[8709]: Finished catalog run in 59.09 seconds Jul 23 02:07:21 buildbot-master79 dhclient[775]: DHCPREQUEST on eth0 to 10.132.49.1 port 67 (xid=0x2db4efe1) Jul 23 02:07:21 buildbot-master79 dhclient[775]: DHCPACK from 10.132.49.1 (xid=0x2db4efe1) Jul 23 02:07:21 buildbot-master79 dhclient[775]: bound to 10.132.49.117 -- renewal in 1738 seconds. Jul 23 05:02:43 buildbot-master79 kernel: imklog 4.6.2, log source = /proc/kmsg started. Jul 23 05:02:43 buildbot-master79 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="863" x-info="http://www.rsyslog.com"] (re)start Jul 23 05:02:43 buildbot-master79 kernel: Initializing cgroup subsys cpuset I enabled the master and watching the master logs.
Assignee | ||
Comment 6•11 years ago
|
||
The master looks good so far. I filed bug 897002 to be more proactive here.
Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 7•11 years ago
|
||
I did have a look at the console log and it was all the usual stuff you get in dmesg at boot time, then the login prompt, then a couple of lines which weren't obviously errors. Should have saved those last two as they've been garbage collected from my brain now.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•